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Chapter  1 
Introduction 


Matrix  multiplication  is  one  of  the  most  fundamental  algorithmic  problems  in  numerical 
linear  algebra,  distributed  computing,  scientific  computing,  and  high-performance  computing. 
Parallelization  of  matrix  multiplication  has  been  extensively  studied  (e.g.,  [21,  12,  24,  2,  51, 
39,  36,  23,  45,  61]).  It  has  been  addressed  using  many  theoretical  approaches,  algorithmic 
tools,  and  software  engineering  methods  in  order  to  optimize  performance  and  obtain  faster 
and  more  efficient  parallel  algorithms  and  implementations. 

To  design  efficient  parallel  algorithms,  it  is  necessary  not  only  to  load  balance  the  compu¬ 
tation,  but  also  to  minimize  the  time  spent  communicating  between  processors.  The  inter- 
processor  communication  costs  are  in  many  cases  significantly  higher  than  the  computational 
costs.  Moreover,  hardware  trends  predict  that  more  problems  will  become  communication- 
bound  in  the  future  [38,  35].  Even  matrix  multiplication,  which  is  widely  considered  to 
be  computation-bound,  becomes  communication-bound  when  a  given  problem  is  run  on 
sufficiently  many  processors. 

Here  we  consider  three  cases  of  matrix  multiplication:  fast  matrix  multiplication  algorithms, 
such  as  Strassen’s,  which  compute  the  product  of  dense  matrices  using  asymptotically  fewer 
than  the  naive  number  of  scalar  products  and  so  run  in  (o(n3))  time  (Chapter  2);  classical 
matrix  multiplication,  for  which  all  of  the  naive  scalar  products  A ■  B^j  must  be  computed 
(Chapter  3);  and  sparse  matrix  multiplication,  where  most  of  the  entries  of  the  input  matrices 
are  zero,  and  only  nonzero  products  are  computed  (Chapter  4).  In  each  case,  we  present- 
a  new  recursive  parallel  algorithm.  The  new  algorithms  are  communication-optimal:  they 
asymptotically  match  communication  lower  bounds,  and  they  communicate  asymptotically 
less  than  previous  algorithms.  In  the  cases  of  sparse  matrix  multiplication  and  high  aspect 
ratio  rectangular  matrices  for  classical  matrix  multiplication  we  present  new,  tight  lower 
bounds.  We  also  present  benchmarking  data  that  shows  our  new  algorithms  are  faster  than 
previous  algorithms  in  practice.  Compared  to  the  best  previous  algorithm,  we  show  speedups 
of  up  to  2.8 x  for  Strassen’s  algorithm,  140 x  for  classical  matrix  multiplication,  and  8x  for 
sparse  matrix  multiplication.  In  Chapter  5,  we  explain  how  to  generalize  our  parallelization 
approach  to  other  recursive  algorithms,  including  those  with  more  dependencies  than  matrix 
multiplication. 
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1.1  Communication  Model 

We  model  communication  of  distributed-memory  parallel  architectures  as  follows.  We  assume 
the  machine  has  P  processors,  each  with  local  memory  of  size  M  words,  which  are  connected 
via  a  network.  Processors  communicate  via  messages,  and  we  assume  that  a  message  of  w 
words  can  be  communicated  in  time  a  +  (3w.  Here  a  and  (3  are  machine  parameters  specifying 
the  overhead  time  per  message  and  the  reciprocal  bandwidth,  respectively.  The  bandwidth 
cost  of  the  algorithm  is  given  by  the  word  count  and  denoted  by  W,  and  the  latency  cost 
is  given  by  the  message  count  and  denoted  by  S.  Similarly  the  computational  cost  is  given 
by  the  number  of  floating  point  operations  and  denoted  by  F.  We  call  the  time  per  floating 
point  operation  7. 

We  count  the  number  of  words,  messages  and  floating  point  operations  along  the  critical 
path  as  defined  in  [66].  That  is,  two  messages  that  are  communicated  between  separate  pairs 
of  processors  simultaneously  are  counted  only  once,  as  are  two  floating  point  operations 
performed  in  parallel  on  different  processors.  Note  that  we  do  not  require  the  communication 
to  be  bulk-synchronous:  some  of  the  processors  may  be  engaged  in  communication  while 
others  are  performing  computations.  This  metric  is  closely  related  to  the  total  running  time 
of  the  algorithm,  which  we  model  as 


aS  +  PW  +  7  F. 

We  assume  that  (1)  the  architecture  is  homogeneous  (that  is,  7  is  the  same  on  all  processors 
and  a  and  (3  are  the  same  between  each  pair  of  processors),  (2)  processors  can  send/receive 
only  one  message  to/from  one  processor  at  a  time  and  they  cannot  overlap  computation  with 
communication  (this  latter  assumption  can  be  dropped,  affecting  the  running  time  by  a  factor 
of  at  most  two),  and  (3)  there  is  no  communication  resource  contention  among  processors. 
That  is,  we  assume  that  there  is  a  link  in  the  network  between  each  pair  of  processors.  Thus 
lower  bounds  derived  in  this  model  are  valid  for  any  network,  but  attainability  of  the  lower 
bounds  depends  on  the  details  of  the  network.  One  way  to  model  more  realistic  networks  is 
as  a  hierarchy  with  different  values  for  a  and  /3  at  each  level. 

Perfect  Strong  Scaling 

We  say  that  an  algorithm  exhibits  perfect  strong  scaling  if  its  running  time  for  a  fixed  problem 
size  decreases  linearly  with  the  number  of  processors;  that  is,  if  all  three  of  F .  W,  and  S 
decrease  linearly  with  P.  Several  of  the  algorithms  we  will  discuss  exhibit  perfect  strong 
scaling  within  certain  ranges.  Our  running  time  model  naturally  generalizes  to  give  an  energy 
model,  and  perfect  strong  scaling  of  runtime  corresponds  to  getting  linear  speedup  in  P  while 
using  no  extra  energy  [32], 
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1.2  Iterative  Parallel  Matrix  Multiplication 
Algorithms 

The  first  scalable  algorithm  for  parallel  matrix  multiplication  is  due  to  Cannon  in  1969  [21]. 
Cannon’s  algorithm  multiplies  two  n  x  n  matrices  on  P  processors  on  a  square  grid  with 
bandwidth  cost  O  [n2 /\/P^j.  SUMMA  generalizes  this  to  arbitrary  matrix  dimensions  and  a 
rectangular  grid  of  processors  [36]  using  broadcasts  rather  than  cyclic  shifts.  We  call  these 
“2D”  algorithms,  because  they  use  a  two-dimensional  processor  grid  and  have  bandwidth  cost 
that  scales  as  1/  \[P . 

Another  class  of  algorithms,  known  as  “3D"  [12,  2]  because  the  communication  pattern 
maps  to  a  three-dimensional  processor  grid,  uses  more  local  memory  and  reduces  communi¬ 
cation  relative  to  2D  algorithms.  The  bandwidth  cost  of  these  algorithms  scales  as  1/P2/3. 
Unfortunately,  they  can  only  be  used  if  a  factor  of  ^(P1/3)  extra  memory  is  available. 

The  “2.5D”  algorithm  [52,  61]  interpolates  between  2D  and  3D  algorithms,  using  as  much 
memory  as  is  available  to  reduce  communication.  For  square  matrices,  it  asymptotically 
matches  the  lower  bounds  of  [45]  and  [5],  and  so  is  communication-optimal.  The  recently 
proposed  3D-SUMMA  algorithm  [57]  attempts  to  generalize  the  2.5D  algorithm  to  rectan¬ 
gular  matrices.  As  our  communication  lower  bounds  in  Section  3.1  show,  3D-SUMMA  is 
communication-optimal  for  many,  but  not  all,  matrix  dimensions. 

1.3  Parallelizing  Recursive  Algorithms 

An  alternative  to  the  iterative  parallelization  approach  is  what  we  call  the  BFS/DFS  approach. 
BFS/DFS  algorithms  are  based  on  sequential  recursive  algorithms,  and  view  the  processor 
layout  as  a  hierarchy  rather  than  a  grid.  Breadth-first  steps  (BFS)  and  depth-first  steps  (DFS) 
are  alternate  ways  to  solve  the  subproblems.  At  a  BFS  step,  all  of  the  subproblems  are  solved 
in  parallel  on  independent  subsets  of  the  processors,  whereas  at  a  DFS  all  the  processors 
work  together  to  solve  the  subproblems  serially.  In  general,  BFS  steps  reduce  communication 
costs,  but  may  require  extra  memory  relative  to  DFS  steps.  The  extra  memory  at  a  BFS  step 
is  because  each  subset  of  the  processors  must  have  space  for  the  entire  input  and  output  of 
its  subproblem.  With  correct  interleaving  of  BFS  and  DFS  steps  to  stay  within  the  available 
memory,  we  show  that  BFS/DFS  gives  communication-optimal  algorithms  for  all  the  cases  of 
matrix  multiplication  we  consider.  Because  of  their  recursive  structure,  BFS/DFS  algorithms 
are  cache-,  processor-,  and  network-oblivious  in  the  sense  of  [34,  13,  25,  26].  Note  that  they  are 
not  oblivious  to  the  aggregated  local  memory  (DRAM)  size,  which  is  necessary  to  determine 
the  optimal  interleaving  of  BFS  and  DFS  steps.  Hence  they  should  perform  well  without 
much  tuning  on  hierarchical  architectures,  which  are  becoming  more  common. 

Our  primary  focus  is  on  matrix  multiplication  algorithms.  In  this  case,  subproblems  are 
independent  of  each  other,  and  hence  may  be  computed  simultaneously.  In  Chapter  5,  we  relax 
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this  requirement,  and  show  how  to  parallelize  some  recursive  algorithms  with  dependencies 
between  the  subproblems. 

1.4  The  Loomis- Whitney  Inequality 

The  lower  bound  proofs  in  Sections  3.1  and  4.2  make  use  of  the  geometric  embedding 
techniques  of  [45,  10],  so  we  review  them  here.  This  section  may  be  safely  skipped  if  the 
reader  is  only  interested  in  parallelizing  Strassen’s  fast  matrix  multiplication  algorithm  as 
described  in  Chapter  2. 

Let  A  be  an  m  x  k  matrix,  B  be  a  k  x  n  matrix,  and  C  be  an  m  x  n  matrix.  The  rank 
scalar  multiplications  that  the  classical  algorithm  performs  when  computing  C  =  A  ■  B  may 
be  arranged  into  a  rectangular  prism  V  of  size  m  x  n  x  k  with  the  three  matrices  as  its  faces. 
To  perform  a  given  multiplication,  a  processor  must  have  access  to  the  entries  of  A,  B ,  and 
C  corresponding  to  the  projections  onto  the  m  x  k,  n  x  k,  and  m  x  n  faces  of  the  prism, 
respectively.  If  these  entries  are  not  assigned  to  that  processor  by  the  initial  or  final  data 
layout,  these  entries  correspond  to  words  that  must  be  communicated. 

Given  a  set  of  voxels  V  C  V,  the  projections  of  the  set  onto  three  orthogonal  faces 
corresponds  to  the  input  elements  of  A  and  B  necessary  to  perform  the  multiplications  and 
the  output  elements  of  C  that  the  products  must  update.  The  computation  cube  and  this 
relationship  of  voxels  to  input  and  output  matrix  elements  is  shown  in  Figure  1.1.  The 
following  lemma  relates  the  volume  of  V  to  its  projections: 

Lemma  1.1.  [49]  Let  V  be  a  finite  set  of  lattice  points  in  R3,  i.e.,  points  ( x,y,z )  with 
integer  coordinates.  Let  Vx  be  the  projection  of  V  in  the  x -direction,  i.e.,  all  points  ( y,z ) 
such  that  there  exists  an  x  so  that  ( x,y,z )  G  V.  Define  Vy  and  Vz  similarly.  Let  |  •  |  denote 
the  cardinality  of  a  set.  Then  \V\  <  -\/|14|  •  \Vy\  ■  \V~\. 


Figure  1.1:  The  computation  rectangular  prism  for  matrix  multiplication,  with  a  specified  sub¬ 
set  of  voxels  V  along  with  its  three  projections.  Each  voxel  corresponds  to  the  multiplication 
of  its  projection  onto  A  and  B ,  and  contributes  to  its  projection  onto  C. 
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Chapter  2 

Strassen’s  Matrix  Multiplication 


Strassen  showed  that  2x2  matrix  multiplication  can  be  performed  using  7  multiplications 
and  18  additions,  instead  of  the  classical  algorithm  that  does  8  multiplications  and  4  additions 
[62],  By  recursive  application  this  yields  an  algorithm  which  multiplies  two  n  x  n  matrices 
with  O^nS0)  flops,  where  cuo  =  log2  7  ~  2.81.  Winograd  improved  the  algorithm  to  use  7 
multiplications  and  15  additions  in  the  base  case,  thus  decreasing  the  hidden  constant  in  the 
O  notation  [65].  Further  reduction  in  the  number  of  additions  is  not  possible  [56,  16]. 

In  this  chapter,  we  show  how  to  parallelize  Strassen’s  algorithm  in  a  communication- 
optimal  way,  and  benchmark  and  analyze  its  performance  on  several  machines.  The  results  in 
this  chapter  are  joint  work  with  Grey  Ballard,  James  Demmel,  Olga  Holtz,  and  Oded  Schwartz. 
The  algorithm,  analysis,  and  preliminary  performance  appear  in  [6];  more  performance  results, 
implementation  details,  and  performance  modeling  appear  in  [48]. 

We  use  the  term  parallel  Strassen  algorithm  for  a  parallel  algorithm  that  performs  exactly 
the  same  arithmetic  operations  as  any  variant  of  Strassen’s  (sequential)  algorithm;  that  is, 
any  algorithm  based  on  2  x  2  matrix  multiplication  using  7  scalar  multiplications.  We  use  the 
broader  term  parallel  Strassen-based  algorithm  for  a  parallel  matrix  multiplication  algorithm 
that  is  a  hybrid  of  any  variant  of  Strassen’s  and  the  classical  algorithm.  Examples  of  such 
hybrids  are  given  in  the  next  section.  Note  that  Theorems  2.1  and  2.2  below  apply  to  parallel 
Strassen  algorithms,  but  not  to  all  Strassen-based  algorithms. 

The  rest  of  this  chapter  is  organized  as  follows.  We  review  the  Strassen-Winograd  algorithm 
in  Section  2.1,  and  the  communication-cost  lower  bounds  for  Strassen-like  algorithms  in 
Section  2.2.  Section  2.3  presents  our  new  communication-optimal  algorithm.  In  Section  2.4 
we  analyze  the  communication  costs  of  other  approaches  to  parallelizing  Strassen’s  algorithm, 
and  show  that  none  of  them  are  communication-optimal.  We  present  performance  results  for 
both  classical  and  Strassen-based  algorithms  in  Section  2.5  and  compare  the  performance  to 
a  theoretical  model  in  Section  2.6.  We  discuss  details  of  our  implementation,  and  numerical 
stability  of  Strassen’s  algorithm  in  Sections  2.7  and  2.8,  respectively.  Finally,  in  Section  2.9 
we  show  how  to  generalize  our  approach  to  other  fast  matrix  multiplication  algorithms. 


CHAPTER  2.  STRASSEN’S  MATRIX  MULTIPLICATION 


6 


2.1  Strassen-Winograd  Algorithm 


To  perform  matrix  multiplication  using  the  Strassen-Winograd  algorithm,  first  divide  the 
input  matrices  A,  B  and  output  matrix  C  into  4  submatrices: 


A 


An  A12 

A'2 1  A2  2 


Bn  B\2 
B2\  B  22 


C 


C\  i  c12 
C'21  C'22 


Then  compute  7  linear  combinations  of  the  submatrices  of  each  of  A  and  B ,  call  these  7) 
and  Si,  respectively;  multiply  them  pairwise;  then  compute  the  submatrices  of  C  as  linear 
combinations  of  these  products: 


T\ 

T2 

t3 

t4 

t5 

n 

t7 


—  An 

=  A12 

—  A21  +  a22 

=  T.i  —  An 
=  An  —  A21 
=  A 12  —  T4 

—  A22 


51  =  B 

52 

53 
Sa 
S> 5 
Sq 

s7 
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=  B  21 

=  B12  —  Bn 
=  B22  —  S3 
=  B  22  —  B 12 
=  B22 
=  S4-  B2i 

This  is  one  step  of  Strassen-Winograd.  The  algorithm  is  recursive  since  it  can  be  used 
for  each  of  the  7  smaller  matrix  multiplications.  In  practice,  one  often  uses  only  a  few 
steps  of  Strassen-Winograd,  although  to  attain  0(na’°)  computational  cost,  it  is  necessary 
to  recursively  apply  it  all  the  way  down  to  matrices  of  size  0(1)  x  0(1).  The  precise 
computational  cost  of  Strassen-Winograd  is 


Qi  =  Ti  ■  Si 

Q2  —  T2  ■  S2 

Q:i  —  T3  ■  S3 
Qa  =  T4  ■  S4 
Qe>  =  t5  •  s5 
Qe  =  n-  S6 
Q7  =  T7  ■  S7 


Ui 

U2 

U3 

On 

O12 

O21 

O22 


—  Q 1  +  Q4 

—  Ui  +  Q5 
=  Ui  +  Q3 

=  Qi  +  Q2 

—  U3  +  Qq 

=  u2-q7 

=  U2  +  Q3 


F(n)  =  csn‘ 


U)0 


5  n2 


(2.1) 


Here  cs  is  a  constant  depending  on  the  cutoff  point  at  which  one  switches  to  the  classical 
algorithm.  For  a  cutoff  size  of  no,  the  constant  is  cs  =  (2no  +  4)/nQ0-2  which  is  minimized  at 
no  =  8  yielding  a  computational  cost  of  approximately  3.73 nP0  —  5 n2. 


2.2  Communication  Lower  Bounds 


For  parallel  Strassen  algorithms,  the  bandwidth  cost  lower  bound  has  been  proved  using 
expansion  arguments  on  the  computation  graph  [9],  and  the  latency  cost  lower  bound  is  an 
immediate  corollary.  We  believe  the  requirement  of  no  recomputation  is  purely  technical,  but 
there  is  no  known  proof  of  the  lower  bounds  without  it. 


Theorem  2.1.  (Memory-dependent  lower  bound)  [9]  Consider  a  parallel  Strassen  algorithm 
running  on  P  processors  each  with  local  memory  size  M.  Let  W(n,  P,  M)  be  the  bandwidth 
cost  and  S(n,P,M )  be  the  latency  cost  of  the  algorithm.  Assume  that  no  intermediate  values 
are  computed  twice.  Then 


5 


W(n,  P,  M)  =  Q 
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S(n,  P,  M) 


We  extended  this  to  a  memory-independent  lower  bound  using  the  same  expansion 
approach: 


Theorem  2.2.  (Memory-independent  lower  bound)  [5]  Consider  a  parallel  Strassen  algorithm 
running  on  P  processors.  Let  W(n,P)  be  the  bandwidth  cost  and  S(n,P )  be  the  latency  cost 
of  the  algorithm.  Assume  that  no  intermediate  values  are  computed  twice.  Assume  only 
one  copy  of  the  input  data  is  stored  at  the  start  of  the  algorithm  and  the  computation  is 
load-balanced  in  an  asymptotic  sense.  Then 


W(n,P)  =  n  (Py , 

S(n,  P)  =  O  (1) . 


Note  that  when  M  =  0{n2 / P2/U°f  the  memory-dependent  lower  bound  dominates,  and 
when  M  =  Ct(n2 / P2P°f  the  memory-independent  lower  bound  dominates. 


2.3  Communication- Avoiding  Parallel  Strassen 


In  this  section  we  present  the  CAPS  algorithm,  and  prove  it  is  communication-optimal.  See 
Algorithm  1  for  a  concise  presentation  and  Algorithm  2  for  a  more  detailed  description. 

Theorem  2.3.  CAPS  has  computational  cost 


0 


5 


bandwidth  cost 

(  i  TP0 

\  \  PM^o/2-i 

and  latency  cost 

e(max{pik75logP'logi5})  ■ 

By  Theorems  2.1  and  2.2,  we  see  that  CAPS  has  optimal  computational  and  bandwidth 
costs,  and  that  its  latency  cost  is  at  most  logP  away  from  optimal.  We  prove  Theorem  2.3 
in  Section  2.3.5. 


CHAPTER  2.  STRASSEN’S  MATRIX  MULTIPLICATION 


BFS 

A-B 


DFS 

A-B 


Figure  2.1:  Representation  of  BFS  and  DFS  steps.  In  a  BFS  step,  all  seven  subproblems  are 
computed  at  once,  each  on  1/7  of  the  processors.  In  a  DFS  step,  the  seven  subproblems  are 
computed  in  sequence,  each  using  all  the  processors.  The  notation  follows  that  of  Section  2.1. 


2.3.1  Overview  of  CAPS 

Consider  the  recursion  tree  of  Strassen’s  sequential  algorithm.  CAPS  traverses  it  in  parallel 
as  follows.  At  each  level  of  the  tree,  the  algorithm  proceeds  in  one  of  two  ways.  A  “breadth- 
first-step”  (BFS)  divides  the  7  subproblems  among  the  processors,  so  that  l  of  the  processors 
work  on  each  subproblem  independently  and  in  parallel.  A  “depth- first-step”  (DFS)  uses  all 
the  processors  on  each  subproblem,  solving  each  one  in  sequence.  See  Figure  2.1. 

In  short,  a  BFS  step  requires  more  memory  but  reduces  communication  costs  while  a  DFS 
step  requires  little  extra  memory  but  is  less  communication-efficient.  In  order  to  minimize 
communication  costs,  the  algorithm  must  choose  an  ordering  of  BFS  and  DFS  steps  that 
uses  as  much  memory  as  possible. 

Let  k  =  log7  P  and  s  >  k  be  the  number  of  distributed  Strassen  steps  the  algorithm 
will  take.  For  simplicity,  in  this  section,  we  assume  that  n  is  a  multiple  of  2s7^k^2\  If  k  is 
even,  the  restriction  simplifies  to  n  being  a  multiple  of  2 s\fP.  Since  P  is  a  power  of  7,  it  is 
sometimes  convenient  to  think  of  the  processors  as  numbered  in  base  7.  CAPS  performs  s 
steps  of  Strassen’s  algorithm  and  finishes  the  calculation  with  local  matrix  multiplication. 
The  algorithm  can  easily  be  generalized  to  other  values  of  n  by  padding  or  dynamic  peeling 
(where,  at  each  recursive  step,  if  the  matrix  dimension  is  odd  the  last  row  and  column  are 
handled  separately). 

We  consider  two  simple  schemes  of  traversing  the  recursion  tree  with  BFS  and  DFS  steps. 
The  first  scheme,  which  we  call  the  Unlimited  Memory  (UM)  scheme,  is  to  take  k  BFS  steps 
in  a  row.  This  approach  is  possible  only  if  there  is  sufficient  available  memory.  It  is  also  used 
in  the  second  scheme,  which  we  call  the  Limited  Memory  (LM)  scheme,  which  is  to  take  I 
DFS  steps  in  a  row  followed  by  k  BFS  steps  in  a  row,  where  I  is  minimized  subject  to  the 
memory  constraints. 

It  is  possible  to  use  a  more  complicated  scheme  that  interleaves  BFS  and  DFS  steps  to 
reduce  communication.  We  show  that  the  LM  scheme  is  optimal  up  to  a  constant  factor,  and 
hence  no  more  than  a  constant  factor  improvement  can  be  attained  from  interleaving. 
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Algorithm  1  CAPS,  in  brief.  For  more  details,  see  Algorithm  2. 

Input:  A,  B,  n,  where  A  and  B  are  n  x  n  matrices 
P  =  number  of  processors 

Output:  C  =  A  ■  B 

>  The  dependence  of  the  Si  s  on  A,  the  TVs  on  B  and  C  on  the  Q/  s  follows  the  Strassen 
or  Strassen- Winograd  algorithm.  See  Section  2.1. 

1:  procedure  C  =  CAPS(A,  B,  n,  P ) 

2:  if  enough  memory  then  >  Do  a  BFS  step 

3:  locally  compute  the  Si  s  and  Tj’s  from  A  and  B 

4:  parallel  for  i  —  1 ...  7  do 

5:  redistribute  St  and  T% 

6:  Qi  =  CAPS  (Si,  Ti,  nf  2,  P/7 ) 

7:  redistribute  Qi 

8:  locally  compute  C  from  all  the  Q/s 

9:  else  >  Do  a  DFS  step 

10:  for  i  —  1 ...  7  do 

11:  locally  compute  Si  and  Tt  from  A  and  B 

12:  Qi  =  CAPS  (Si,  Ti ,  n/ 2,  P ) 

13:  locally  compute  contribution  of  Qi  to  C 


2.3.2  Data  layout 

We  require  that  the  data  layout  of  the  matrices  satisfies  the  following  two  properties: 

1.  At  each  of  the  s  Strassen  recursion  steps,  the  data  layouts  of  the  four  sub-matrices  of 
each  of  A,  B ,  and  C  must  match  so  that  the  weighted  additions  of  these  sub-matrices 
can  be  performed  locally.  This  technique  follows  [51]  and  allows  communication- free 
DFS  steps. 

2.  Each  of  these  submatrices  must  be  equally  distributed  among  the  P  processors  for  load 
balancing. 

There  are  many  data  layouts  that  satisfy  these  properties,  perhaps  the  simplest  being  block- 
cyclic  layout  with  a  processor  grid  of  size  7^k^2^  x  and  block  size  2s7/k/2\  x  2s7ffc/2i  •  (When 
k  =  log7  P  is  even  these  expressions  simplify  to  a  processor  grid  of  size  \/~P  x  y/P  and  block 
size  See  Figure  2.2. 

Any  layout  that  we  use  is  specified  by  three  parameters,  (n,  P,  s ),  and  intermediate  stages 
of  the  computation  use  the  same  layout  with  smaller  values  of  the  parameters.  A  BFS  step 
reduces  a  multiplication  problem  with  layout  parameters  (n,  P,  s)  to  seven  subproblems  with 
layout  parameters  (n/2,  P/7,  s  —  1).  A  DFS  step  reduces  a  multiplication  problem  with  layout 
parameters  ( n,P,s )  to  seven  subproblems  with  layout  parameters  (n/2,P,  s  —  1). 
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Note  that  if  the  input  data  is  initially  load-balanced  but  distributed  using  a  different 
layout,  we  can  rearrange  it  to  the  above  layout  using  no  more  than  O  words  and  0(P) 
messages.  This  has  no  asymptotic  effect  on  the  bandwidth  cost  but  may  significantly  increase 
the  latency  cost. 
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Figure  2.2:  An  example  matrix  layout  for  CAPS.  Each  of  the  16  submatrices  as  shown  on 
the  left  has  exactly  the  same  layout.  The  colored  blocks  are  the  ones  owned  by  processor  00. 
On  the  right  is  a  zoomed-in  view  of  one  submatrix,  showing  which  processor,  numbered  base 
7,  owns  each  block.  This  is  block-cyclic  layout  with  some  blocksize  b,  and  matches  our  layout 
requirements  with  parameters  (n  —  4  •  7  •  b,  P  —  49,  s  =  2). 


2.3.3  Unlimited  Memory  (UM)  scheme 

In  the  UM  scheme,  we  take  k  =  log7  P  BFS  steps  in  a  row.  Since  a  BFS  step  reduces  the 
number  of  processors  involved  in  each  subproblem  by  a  factor  of  7,  after  k  BFS  steps  each 
subproblem  is  assigned  to  a  single  processor,  and  so  is  computed  locally  with  no  further 
communication  costs.  We  first  describe  a  BFS  step  in  more  detail. 

The  matrices  A  and  B  are  initially  distributed  as  described  in  Section  2.3.2.  In  order  to 
take  a  recursive  step,  the  14  matrices  Si, . . .  S7,  Tf, . . . ,  X7  must  be  computed.  Each  processor 
allocates  space  for  all  14  matrices  and  performs  local  additions  and  subtractions  to  compute 
its  portion  of  the  matrices.  Recall  that  the  submatrices  are  distributed  identically,  so  this 
step  requires  no  communication.  If  the  layouts  of  A  and  B  have  parameters  (n,P,s),  the  Si 
and  the  T%  now  have  layout  parameters  (n/2,  P,  s  —  1). 

The  next  step  is  to  redistribute  these  14  matrices  so  that  the  7  pairs  of  matrices  (Si,  T) 
exist  on  disjoint  sets  of  P/7  processors.  This  requires  disjoint  sets  of  7  processors  performing 
an  all-to-all  communication  step  (each  processor  must  send  and  receive  a  message  from  each 
of  the  other  6).  To  see  this,  consider  the  numbering  of  the  processors  base-7.  On  the  mth 
BFS  step,  the  communication  is  between  the  seven  processors  whose  numbers  agree  on  all 
digits  except  the  mth  (counting  from  the  right).  After  the  mth  BFS  step,  the  set  of  processors 
working  on  a  given  subproblem  share  the  same  m-digit  suffix.  After  the  above  communication 
is  performed,  the  layout  of  Si  and  Tt  has  parameters  (n/2,  P/7,  s  —  1),  and  the  sets  of 
processors  that  own  the  T%  and  Si  are  disjoint  for  different  values  of  i.  Since  each  all-to-all 
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Algorithm  2  CAPS,  in  detail 
Input:  A,  B,  are  n  x  n  matrices 
P  =  number  of  processors 
rank  =  processor  number  base-7  as  an  array 
M  =  local  memory  size 
Output:  C  =  A  ■  B 

1:  procedure  C  =  CAPS(A,  B,  P,  rank,  M) 

2:  I  =  |~log2  pi/4^i/2~l  >  7  is  number  of  DFS  steps  to  fit  in  memory 

3:  k  =  log7  P 

4:  call  DFS  (A,  5,  C,  k,  £,  rank) 


1:  procedure  DFS  (A,  B,  C ,  k,  7,  rank)  >  Do  C  =  A  ■  B  by  I  DFS,  then  k  BFS  steps 
2:  if  I  <  0  then  call  BFS(  A,  B,  C ,  k,  rank);  return 

3:  for  i  —  1 ...  7  do 

4:  locally  compute  S,  and  Tj  from  A  and  B  >  following  Strassen’s  algorithm 

5:  call  DFS(  St,  Ti ,  Qh  k,  £  -  1,  rank  ) 

6:  locally  compute  contribution  of  Qt  to  C  >  following  Strassen’s  algorithm 


1:  procedure  BFS  (A,  B,  C,  k,  rank) 

>  Do  C  =  A  ■  B  by  k  BFS  steps,  then  local  Strassen 
2:  if  k  ==  0  then  call  localStrassen(A,  B ,  C):  return 

3:  for  i  —  1 ...  7  do 

4:  locally  compute  Si  and  7)  from  A  and  B  >  following  Strassen’s  algorithm 

5:  for  i  —  1 ...  7  do 

6:  target  =  rank 

7:  target  [k]  =  i 

8:  send  S)  to  target 

9:  receive  into  L  >  One  part  of 

10:  send  T)  to  target 

11:  receive  into  R  >  One  part  of 

12:  call  BFS(L,  R,  V,  k  —  1,  rank  ) 

13:  for  i  —  1 ...  7  do 

14:  target  =  rank 

15:  target  [fc]  =  i 

16:  send  ith  part  of  V  to  target 

17:  receive  from  target  into  Qj 

18:  locally  compute  C  from  Qt  >  following  Strassen’s  algorithm 


L  comes  from  each  of  7  processors 
R  comes  from  each  of  7  processors 
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communication  only  involves  seven  processors  no  matter  how  large  P  is,  this  algorithm  does 
not  have  the  scalability  issues  that  typically  come  from  an  all-to-all  communication  pattern. 


Memory  requirements 

The  extra  memory  required  to  take  one  BFS  step  is  the  space  to  store  all  7  triples  Sj,  Tj, 
Qj.  Since  each  of  those  matrices  is  |  the  size  of  A,  B,  and  C,  the  extra  space  required  at  a 
given  step  is  7/4  the  extra  space  required  for  the  previous  step.  We  assume  that  no  extra 
memory  is  required  for  the  local  multiplications.1  Thus,  the  total  local  memory  requirement 
for  taking  k  BFS  steps  is  given  by 


MemUM(n,  P ) 


7  n2 

p2/u0 


4  n2 

P 


=  0 


\  p2/u0  I 


Computational  costs 


The  computation  required  at  a  given  BFS  step  is  that  of  the  local  additions  and  subtractions 
associated  with  computing  the  Si  and  Tj  and  updating  the  output  matrix  C  with  the  Qi. 
Since  Strassen-Winograd  performs  15  additions  and  subtractions,  the  computational  cost- 
recurrence  is 

Ujm(»,  P)  =  15  (44)  +  Fum  (4,  y) 

with  base  case  Tum('«-,  1)  =  csnuo  —  5 n2,  where  cs  is  the  constant  of  Strassen-Winograd.  See 
Section  2.1  for  more  details.  The  solution  to  this  recurrence  is 


-FuM(rq  P) 


csnu°  —  5 n2  „  f  rC°\ 


Communication  costs 

Consider  the  communication  costs  associated  with  the  UM  scheme.  Given  that  the  redis¬ 
tribution  within  a  BFS  step  is  performed  by  an  all-to-all  communication  step  among  sets 
of  7  processors,  each  processor  sends  6  messages  and  receives  6  messages  to  redistribute 
Si, . . . ,  SV,  and  tire  same  for  T1? . . . ,  T7.  After  the  products  Qi  =  S1Tl  are  computed,  each 
processor  sends  6  messages  and  receives  6  messages  to  redistribute  Qi, ,  Q 7.  The  size  of 

each  message  varies  according  to  the  recursion  depth,  and  is  the  number  of  words  a  processor 

2 

owns  of  any  Sj,  Tj,  or  Qi,  namely  fp  words. 

1If  one  does  not  overwrite  the  input,  it  is  impossible  to  run  Strassen  in  place;  however  using  a  few 
temporary  matrices  affects  the  analysis  here  by  a  constant  factor  only. 
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As  each  of  the  Qi  is  computed  simultaneously  on  disjoint  sets  of  P/7  processors,  we  obtain 
a  cost  recurrence  for  the  entire  UM  scheme: 


T)  ^  /  Ti  P 

WW(n,P)  =  36—  +  WVM(-,- 


AP 

Sum  ('«,  P)  =  36  +  Sum 

with  base  case  Sum (n,  1)  =  kUuMCffi  1)  =  0.  Thus 


n  P 
2’  7 


W\jm  (  n,  P )  = 


12n2 


12n2 


=  0 


p2/u0  p 

Sum  P)  =  36  log7  P  =  0  (log  P ) . 


(— j 

1  p2/uj0  I 


(2.2) 


2.3.4  Limited  Memory  (LM)  scheme 

In  this  section  we  discuss  a  scheme  for  traversing  Strassen’s  recursion  tree  in  the  context  of 
limited  memory.  In  the  LM  scheme,  we  take  I  DFS  steps  in  a  row  followed  by  k  BFS  steps  in 
a  row,  where  I  is  minimized  subject  to  the  memory  constraints.  That  is,  we  use  a  sequence  of 
DFS  steps  to  reduce  the  problem  size  so  that  we  can  use  the  UM  scheme  on  each  subproblem 
without  exceeding  the  available  memory. 

Consider  taking  a  single  DFS  step.  Rather  than  allocating  space  for  and  computing  all  14 
matrices  ,Sj  ,  7) , . . . ,  St,  TV  at  once,  the  DFS  step  requires  allocation  of  only  one  subproblem, 
and  each  of  the  Qi  will  be  computed  in  sequence. 

Consider  the  ith  subproblem:  as  before,  both  S)  and  T)  can  be  computed  locally.  After 
Qi  is  computed,  it  is  used  to  update  the  corresponding  quadrants  of  C  and  then  discarded 
so  that  its  space  in  memory  (as  well  as  the  space  for  Si  and  Tj)  can  be  re-used  for  the  next 
subproblem.  In  a  DFS  step,  no  redistribution  occurs.  After  Si  and  T%  are  computed,  all 
processors  participate  in  the  computation  of  Qi. 

We  assume  that  some  extra  memory  is  available.  To  be  precise,  assume  the  matrices  A, 
B,  and  C  require  only  |  of  the  available  memory: 


3n2  1  , r 

- <  -M. 

P  ~  3 


(2.3) 


In  the  LM  scheme,  we  set 


I  =  max 


4n 


pi  M  m 1/2 


(2.4) 


The  following  subsection  shows  that  this  choice  of  I  is  sufficient  not  to  exceed  the  memory 
capacity. 
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Memory  requirements 

The  extra  memory  requirement  for  a  DFS  step  is  the  space  to  store  one  subproblem.  Thus, 
the  extra  space  required  at  this  step  is  1/4  the  space  required  to  store  A,  B,  and  C.  The 
local  memory  requirements  for  the  LM  scheme  is  given  by 


MemLM(n,  P ) 


+  MemUM 


< 


< 


M 


i- i 

E 

i= 0 


1' 

4  1  + 


127 

- M  <  M. 

144 


p2/uo0 


where  the  last  line  follows  from  (2.4)  and  (2.3).  Thus,  the  limited  memory  scheme  does  not 
exceed  the  available  memory. 


Computational  costs 

As  in  the  UM  case,  the  computation  required  at  a  given  DFS  step  is  that  of  the  local 
additions  and  subtractions  associated  with  computing  each  St  and  Tt  and  updating  the 
output  matrix  C  with  the  Qi.  However,  since  all  processors  participate  in  each  subproblem 
and  the  subproblems  are  computed  in  sequence,  the  recurrence  is  given  by 

eLM(n,  P)  =  15  (A)  +  7  ■  Flm  (|,  p)  . 


After  I  steps  of  DFS,  the  size  of  a  subproblems  is  |jr  x  and  there  are  P  processors  involved. 
We  take  k  BFS  steps  to  compute  each  of  these  7l  subproblems.  Thus 

Flm(%,p)=Fvk(%,p), 

and 


Flm  (n,  P ) 


i=0 


+  7l  ■  F\jm 


csrC°  —  5  n2 

P 


Communication  costs 

Since  there  are  no  communication  costs  associated  with  a  DFS  step,  the  recurrence  is  simply 

WLM(n,P)  =  7-WLM^,P) 

SjM(n,  P)  =  7  ■  Slm  ^ ,  -P) 


CHAPTER  2.  STRASSEN’S  MATRIX  MULTIPLICATION 


15 


with  base  cases 

wlm(^,p)  =  wvm(^p) 

Slu(y,,p)=Sum(^.p) 

Thus  the  total  communication  costs  are  given  by 


WLM(n,P)  =  f-WlJM  g,p)  < 
Slm  (n,  P)  =  7e  ■  S\jM  (^, 


^,P  < 


12 . 4aJ°~2rU° 

PM^op-i 

(4n)w° 


=  0 


n 


UJQ 


PM“°/2 


36  log7  P  =  0 


(pM“°/2 

0  f- 


n 


-l 

wo 


V  PM“  °/2 


logP 


(2.5) 


2.3.5  Communication  optimality 

Proof,  (of  Theorem  2.3  on  page  7).  In  the  case  that  M  >  MemuM(w,  P)  =  the 

UM  scheme  is  possible.  Then  the  communication  costs  are  given  by  Equation  (2.2)  on 
page  13  which  matches  the  lower  bound  of  Theorem  2.2  on  page  7.  Thus  the  UM  scheme  is 
communication-optimal  (up  to  a  logarithmic  factor  in  the  latency  cost  and  assuming  that 
the  data  is  initially  distributed  as  described  in  Section  2.3.2).  For  smaller  values  of  M,  the 
LM  scheme  must  be  used.  Then  the  communication  costs  are  given  by  Equation  (2.5)  on 
page  15  and  match  the  lower  bound  of  Theorem  2.1  on  page  6,  so  the  LM  scheme  is  also 
communicat  ion- opt  imal.  □ 

We  note  that  for  the  LM  scheme,  since  both  the  computational  and  communication  costs 
are  proportional  to  ^  (up  to  a  logP  factor  on  S),  we  can  expect  perfect  strong  scaling :  given 
a  fixed  problem  size,  increasing  the  number  of  processors  by  some  factor  will  decrease  each 
cost  by  the  same  factor.  However,  this  strong  scaling  property  has  a  limited  range.  For  any 
fixed  M  and  n,  increasing  P  increases  the  global  memory  size  PM.  The  limit  of  perfect- 
strong  scaling  is  exactly  when  there  is  enough  memory  for  the  UM  scheme.  See  [5]  for  details. 


2.4  Analysis  of  Other  Algorithms 

In  this  section  we  detail  the  asymptotic  communication  costs  of  other  matrix  multiplica¬ 
tion  algorithms,  both  classical  and  Strassen-based.  These  communication  costs  and  the 
corresponding  lower  bounds  are  summarized  in  Table  2.1. 

Many  of  the  algorithms  described  in  this  section  are  hybrids  of  two  different  algorithms. 
We  use  the  convention  that  the  names  of  the  hybrid  algorithms  are  composed  of  the  names 
of  the  two  component  algorithms,  hyphenated.  The  first  name  describes  the  algorithm  used 
at  the  top  level,  on  the  largest  problems,  and  the  second  describes  the  algorithm  used  at  the 
base  level  on  smaller  problems. 
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Table  2.1:  Asymptotic  computational  and  communication  costs  of  Strassen-basend  and  classical  square  matrix  multipli¬ 
cation  algorithms  and  corresponding  lower  bounds.  Here  Uq  =  log2  7  ~  2.81  is  the  exponent  of  Strassen;  I  is  the  number 
of  Strassen  steps  taken.  The  CAPS  algorithm  attains  the  lower  bounds  of  Section  2.2,  and  thus  is  optimal.  All  of  the 
other  Strassen-based  algorithms  have  asymptotically  higher  communication  costs;  see  Section  2.4  for  details. 
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2.4.1  Classical  Algorithms 

Any  classical  algorithm  must-  communicate  asymptotically  more  than  an  optimal  parallel 
Strassen  algorithm  (see  the  lower  bounds  in  [45]).  To  compare  the  communication  cost  upper 
and  lower  bounds  of  classical  and  parallel  Strassen  algorithms,  it  is  necessary  to  consider 
three  cases  for  the  memory  size:  when  the  memory-dependent  bounds  dominate  for  both 
classical  and  Strassen,  when  the  memory-dependent  bound  dominates  for  classical  but  the 
memory-independent  bound  dominates  for  Strassen,  and  when  the  memory-independent, 
bounds  dominate  for  both  classical  and  Strassen.  See  Figure  2.3. 


Case  1  M  —  Ll(n2  /  P)  and  M  =  0(n2 / p2/^°).  The  first-  condition  is  necessary  for  there 
to  be  enough  memory  to  hold  the  input-  and  output-  matrices;  the  second  condition  puts 
both  classical  and  Strassen  algorithms  in  the  memory-dependent  case.  Then  the  ratio  of  the 
bandwidth  costs  is: 

n3  /  nu°  \  /  /n2\  (3-^o)/2 

pm v2  /  pm“ o/2-i )  =  "  U  mJ 

Using  the  two  bounds  that  define  this  case,  we  obtain  R  =  0(P^~^0^2)  and  R  =  ^(P3/^0-1). 


Case  2  M  —  Ut{n2 / P2/^0)  and  M  =  0(n2 / P2//3).  This  means  that  in  the  classical  case  the 
memory-dependent  lower  bound  dominates,  but  in  the  Strassen  case  the  memory-independent 
lower  bound  dominates.  Then  the  ratio  is: 


R 


n3  /  n2  \ 

PM V2  /  P2P °  J 


Using  the  two  bounds  that  define  this  case,  we  obtain  R  =  0(P 3^°  and  R  =  Q(P2/W°  2/3). 


Case  3  M  —  12 (P2/3).  This  means  that  both  the  classical  and  Strassen  lower  bounds  are 
dominated  by  the  memory-independent  cases.  Then  the  ratio  is: 

^  =  0  (/57i / pi:)  =  9  (P2/"°~2/3)  - 

Overall,  depending  on  the  ratio  of  the  problem  size  to  the  available  memory,  the  factor  by 
which  the  classical  bandwidth  costs  exceed  the  Strassen  bandwidth  costs  is  0(Pa),  where  a 
ranges  from  A  _  |  ~  0.046  t-o  ~  0.10.  The  same  sort,  of  analysis  is  used  throughout 
Section  2.4  to  compare  each  algorithm  with  the  parallel  Strassen  lower  bounds. 
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Figure  2.3:  Bandwidth  costs  and  strong  scaling  of  matrix  multiplication:  classical  vs.  Strassen- 
based.  Horizontal  lines  correspond  to  perfect  strong  scaling.  Pmin  is  the  minimum  number  of 
processors  required  to  store  the  input  and  output  matrices. 


2.4.2  2D-Strassen 

One  idea  to  construct  a  parallel  Strassen-based  algorithm  is  to  use  a  2D  classical  algorithm  for 
the  inter-processor  communication,  and  use  the  fast  matrix  multiplication  algorithm  locally 
[51].  We  call  such  an  algorithm  “2D-Strassen” .  It  is  straightforward  to  implement,  but  cannot 
attain  all  the  computational  speedup  from  Strassen  since  it  uses  a  classical  algorithm  for  part 
of  the  computation.  In  particular,  it  does  not  use  Strassen  for  the  largest  matrices,  when 
Strassen  would  provide  the  greatest  reduction  in  computation.  As  a  result,  the  computational 
cost  exceeds  Q(nUo/P)  by  a  factor  of  pU~u °)/2  ~  po.io  2D-Strassen  algorithm  has  the 
same  communication  cost  as  2D  algorithms,  and  hence  does  not  match  the  communication 
costs  of  CAPS.  In  comparing  the  2D-Strasscn  bandwidth  cost,  0(n2 / P1/2),  to  the  CAPS 
bandwidth  cost  in  Section  2.3,  note  that  for  the  problem  to  fit  in  memory  we  always  have 
M  =  0(n2/P).  The  bandwidth  cost  exceeds  that  of  CAPS  by  a  factor  of  Pa,  where  a  ranges 
from  (3  —  cjo)/2  ~  .10  to  2/luq  —  1/2  «  .21,  depending  on  the  relative  problem  size.  Similarly, 
the  latency  cost,  0(P1//2),  exceeds  that  of  CAPS  by  a  factor  of  Pa  where  a  ranges  from 
(3  —  ujq)/2  «  .10  to  1/2  =  .5. 
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2.4.3  Strassen-2D 

The  “Strassen-2D”  algorithm  applies  I  DFS  steps  of  Strassen’s  algorithm  at  the  top  level,  and 
performs  the  7e  smaller  matrix  multiplications  using  a  2D  algorithm.  By  choosing  certain  data 
layouts  as  in  Section  2.3.2,  it  is  possible  to  do  the  additions  and  subtractions  for  Strassen’s 
algorithm  without  any  communication  [51].  However,  Strassen-2D  is  also  unable  to  match  the 
communication  costs  of  CAPS.  Moreover,  the  speedup  of  Strassen-2D  in  computation  comes 
at  the  expense  of  extra  communication.  For  large  numbers  of  Strassen  steps  £,  Strassen-2D  can 
approach  the  computational  lower  bound  of  Strassen,  but  each  step  increases  the  bandwidth 
cost  by  a  factor  of  Lt  and  the  latency  cost  by  a  factor  of  7.  Thus  the  bandwidth  cost  of 

Strassen-2D  is  a  factor  of  (|)  higher  than  2D-Strassen,  which  is  already  higher  than  that  of 
CAPS.  The  latency  cost  is  even  worse:  Strassen-2D  is  a  factor  of  7f  higher  than  2D-Strassen. 

One  can  reduce  the  latency  cost  of  Strassen-2D  at  the  expense  of  a  larger  memory  footprint. 
Since  Strassen-2D  runs  a  2D  algorithm  7£  times  on  the  same  set  of  processors,  it  is  possible 
to  pack  together  messages  from  independent  matrix  multiplications.  In  the  best  case,  the 
latency  cost  is  reduced  to  the  cost  of  2D-Strassen,  which  is  still  above  that  of  CAPS,  at  the 
expense  of  using  a  factor  of  ( |)  more  memory. 

2.4.4  2.5D-Strassen 

A  natural  idea  is  to  replace  a  2D  classical  algorithm  in  2D-Strassen  with  the  superior  2.5D 
classical  algorithm  to  obtain  an  algorithm  we  call  2.5D-Strassen.  This  algorithm  uses  the 
2.5D  algorithm  for  the  inter-processor  communication,  and  then  uses  Strassen  for  the  local 
computation.  When  M  =  0(n2/P),  2.5D-Strassen  is  exactly  the  same  as  2D-Strassen,  but 
when  there  is  extra  memory  it  both  decreases  the  communication  cost  and  decreases  the 
computational  cost  since  the  local  matrix  multiplications  are  performed  (using  Strassen)  on 
larger  matrices.  To  be  precise,  the  computational  cost  exceeds  the  lower  bound  by  a  factor  of 
Pa  where  a  ranges  from  1  —  ^  ps  0.064  to  «  0.10  depending  on  the  relative  problem  size. 
The  bandwidth  cost  exceeds  the  bandwidth  cost  of  CAPS  by  a  factor  of  Pa  where  a  ranges 
from  A  _  |  ~  0.046  to  ~  0.10.  In  terms  of  latency,  the  cost  of  p^3/2  +  log  P  exceeds 
the  latency  cost  of  CAPS  by  a  factor  ranging  from  logP  to  pU~^0)/2  ^  po.io,  ciep end i ng  on 
the  relative  problem  size. 

2.4.5  Strassen-2.5D 

Similarly,  by  replacing  a  2D  algorithm  with  2.5D  in  Strassen-2D,  one  obtains  the  new  algorithm 
we  call  Strassen-2.5D.  First  one  takes  i  DFS  steps  of  Strassen,  which  can  be  done  without 
communication,  and  then  one  applies  the  2.5D  algorithm  to  each  of  the  7f  subproblems.  The 
computational  cost  is  exactly  the  same  as  Strassen-2D,  but  the  communication  cost  will 
typically  be  lower.  Each  of  the  7(  subproblems  is  multiplication  of  n/ 2*  x  n/2e  matrices.  Each 
subproblem  uses  only  l/4£  as  much  memory  as  the  original  problem.  Thus  there  may  be  a 
large  amount  of  extra  memory  available  for  each  subproblem,  and  the  lower  communication 
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costs  of  the  2.5D  algorithm  help.  The  choice  of  I  that  minimizes  the  bandwidth  cost  is 

£opt  =  max  {o,  |k)g2  M1/2pl/3j  }  • 

2 

The  same  choice  minimizes  the  latency  cost.  Note  that  when  M  >  -^§73,  taking  zero  Strassen 
steps  minimizes  the  communication  within  the  constraints  of  the  St-rassen-2.5D  algorithm. 
With  I  =  £opt,  the  bandwidth  cost  is  a  factor  of  P1_a; 0/3  P0-064  aPoyg  that,  0f  CAPS. 

Additionally,  the  computational  cost  is  not  optimal,  and  using  I  =  £opt,  the  computational 
cost  exceeds  the  optimal  by  a  factor  of  p1-UJo/3^/j3/2-u0/2  ^  po.o64^o.o96_ 

It  is  also  possible  to  take  I  >  £opt  steps  of  Strassen  to  decrease  the  computational 
cost  further.  However  the  decreased  computational  cost  comes  at  the  expense  of  higher 
communication  cost,  as  in  the  case  of  Strassen-2D.  In  particular,  each  extra  step  over  £opt 
increases  the  bandwidth  cost  by  a  factor  of  7  and  the  latency  cost  by  a  factor  of  7.  As 
with  Strassen-2D,  it  is  possible  to  use  extra  memory  to  pack  together  messages  from  several 
subproblems  and  decrease  the  latency  cost,  but  not  the  bandwidth  cost. 


2.5  Performance  Results 


We  have  implemented  CAPS  using  MPI  on  three  supercomputers,  a  Cray  XE6  (Hopper2),  an 
IBM  BG/P  (Intrepid3),  and  a  Cray  XT4  (Franklin4),  and  we  compare  it  to  various  previous 
classical  and  Strassen-based  algorithms.  All  our  experiments  are  in  double  precision  on 
random  input  matrices.  CAPS  performs  less  communication  than  communication-optimal 
classical  algorithms,  and  much  less  than  previous  Strassen-based  algorithms.  As  a  result  it 
outperforms  all  classical  algorithms,  both  on  large  problems  (because  of  the  lower  flop  count 
of  Strassen)  and  on  small  problems  scaled  up  to  many  processors  (which  are  communication 
bound,  so  the  lower  communication  costs  of  CAPS  make  it  superior).  It  also  outperforms 
previous  Strassen-based  algorithms  because  of  its  lower  communication  costs. 

Effective  performance  is  a  useful  construct  for  comparing  classical  and  fast  matrix  mul¬ 
tiplication  algorithms.  It  is  the  performance,  normalized  with  respect  to  the  arithmetic 
complexity  of  classical  matrix  multiplication,  2 n3: 


Effective  flop/s 


2  n3 

Execution  time  in  seconds 


For  classical  algorithms,  this  gives  exactly  the  flop  rate.  For  fast  matrix  multiplication 
algorithms  it  gives  the  relative  performance,  but  does  not  accurately  represent  the  number  of 
floating  point  operations  performed. 

2For  machine  details,  see  http://www.nersc.gov/users/computational-systems/hopper 

3For  machine  details,  see  http://www.alcf.anl.gov/intrepid 

4For  machine  details,  see  http://www.nersc.gov/users/computational-systems/retired-systems/ 
franklin 
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For  each  of  the  three  machines,  we  present  two  types  of  plots.  First,  in  Figures  2.4a, 
2.4c,  and  2.4e,  we  show  strong  scaling  plots  for  a  fixed,  large  matrix  dimension  where  the 
x-axis  corresponds  to  number  of  processor  cores  (on  a  log  scale)  and  the  y-axis  corresponds 
to  fraction  of  peak  performance,  as  measured  by  the  effective  performance.  Horizontal  lines 
in  the  plots  correspond  to  perfect  strong  scaling. 

There  are  general  trends  for  all  algorithms  presented  in  these  plots.  On  the  left  side  of 
the  plots,  the  number  of  processors  is  small  enough  such  that  the  input  and  output  matrices 
nearly  fill  the  memories  of  the  processors.  As  the  number  of  processors  increases,  both 
2.5D  and  CAPS  can  exhibit  perfect  strong  scaling  within  limited  ranges.  We  demarcate  the 
strong-scaling  range  of  CAPS  as  defined  in  Section  1.1  with  a  shaded  region.  The  slightly 
larger  strong-scaling  range  for  the  classical  2.5D  algorithm  is  also  shown.  To  the  right  of  the 
strong  scaling  range,  CAPS  must  begin  to  lose  performance,  as  per-processor  communication 
no  longer  scales  with  1/P.  While  CAPS  performance  should  theoretically  degrade  more 
slowly  than  classical  algorithms,  network  resource  contention  can  also  be  a  limiting  factor. 
The  pattern  of  BFS  and  DFS  steps  used  by  CAPS  for  these  benchmarks  is  shown  in  Table  2.2 
when  the  number  of  MP1  processes  is  a  power  of  7. 

Second,  we  show  execution  time  for  fixed,  small  matrix  dimension  over  an  increasing 
number  of  processors.  See  Figures  2.4b,  2.4d,  and  2.4f.  For  these  problem  sizes,  the  execution 
time  is  dominated  by  communication,  and  the  speedup  relative  to  classical  algorithms  is  based 
primarily  on  decreases  in  communication.  The  optimal  number  of  processors  to  minimize 
time  to  solution  varies  for  each  implementation  and  machine.  These  plots  do  not  show  strong 
scaling  ranges.  For  both  2.5D  and  CAPS  if  a  problem  fits  on  one  processor,  that  is  Pm\n  =  1, 
then  for  2.5D  Pmax  =  Pb®  =  1  and  for  CAPS  Pmax  =  P^'°(2  =  1,  which  means  that  there  is 
no  strong  scaling  range. 

Note  that  because  several  of  the  implementations,  including  CAPS,  are  prototypes,  each 
has  its  own  requirement  on  the  matrix  size  n  and  the  number  of  MP1  processes  P.  We  have 
arranged  for  all  algorithms  in  a  given  plot  to  use  the  same  value  of  n,  but  the  values  of  P 
usually  do  not  match  between  algorithms.  In  both  types  of  plots,  we  are  comparing  CAPS 
performance  with  the  best  classical  implementations  and  previous  Strassen-based  approaches. 
To  simplify  the  plots,  we  omit  the  performance  of  algorithms  that  are  dominated  by  other 
similar  algorithms. 

2.5.1  Cray  XE6  Hopper 

Hopper  is  a  Cray  XE6  at  the  National  Energy  Research  Scientific  Computing  Center,  ft 
consists  of  6,384  compute  nodes,  each  of  which  has  2  twelve-core  AMD  “MagnyCours”  2.1 
GHz  processors,  and  32  GB  of  DRAM  (384  of  the  nodes  have  64  GB  of  DRAM).  The  24  cores 
are  divided  between  4  NUMA  regions.  Parallelism  between  the  6  cores  in  a  NUMA  region 
comes  from  the  threaded  BLAS  implementation  in  Cray’s  LibSci  version  11.0.05.  Hopper’s 
peak  double  precision  rate  is  50.4  Gflop/s  per  NUMA  region  or  1.28  Pflop/s  for  the  entire 
machine.  As  of  November  2011,  it  was  ranked  number  8  on  the  TOP500  list  [53],  with  a 
LINPACK  score  of  1.05  Tflop/s  on  a  matrix  of  dimension  about  4.5  million. 


Effective  Performance,  Fraction  of  Peak  Effective  Performance,  Fraction  of  Peak  Effective  Performance,  Fraction  of  Peak 
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(a)  Hopper  (Cray  XE6),  n  =  131712 


(c)  Intrepid  (IBM  BG/P),  n  =  65856 


(e)  Franklin  (Cray  XT4),  n  =  94080 


CAPS  — » —  2D-Strassen  • 

2.5D-Strassen  — ■—  Strassen-2D  * 


(b)  Hopper  (Cray  XE6),  n  =  4704 


(d)  Intrepid  (IBM  BG/P),  n  =  4704 


(f)  Franklin  (Cray  XT4),  n  =  3136 


2.5D  — B— 


Classical  Peak 


Figure  2.4:  Strong  scaling  results  on  three  machines.  Left  column:  effective  performance  on 
large  matrices  (up  is  good).  Right  column:  execution  time  on  small  matrices  (down  is  good). 
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Machine,  n 

P 

cores 

pattern 

memory  use  per  process 

Hopper,  131712 

343 

2058 

D,B,B,B,D,D,D 

5615  MB 

Hopper,  131712 

2401 

14406 

B,B,B,B,D,D,D 

4816  MB 

Intrepid,  65856 

343 

1372 

D,B,B,B 

1319  MB 

Intrepid,  65856 

2401 

9604 

B,B,B,B 

1119  MB 

Intrepid,  65856 

16807 

67228 

B,B,B,B,B 

289  MB 

Franklin,  94080 

49 

196 

D,D,B,B,D,D 

6819  MB 

Franklin,  94080 

343 

1372 

B,B,D,B,D,D 

5902  MB 

Franklin,  94080 

2401 

9604 

B,B,B,B,D,D 

2449  MB 

Table  2.2:  The  pattern  of  BFS  and  DFS  steps  and  memory  usage  for  CAPS. 


CAPS  outperforms  all  of  the  previous  algorithms.  For  the  large  problem  (n  =  131712),  it 
attains  performance  as  high  as  30%  above  the  peak  for  classical  matrix  multiplication,  83% 
above  2D,  and  75%  above  Strassen-2D.  On  this  machine,  we  benchmark  ScaLAPACK/PBLAS 
(part  of  Cray’s  LibSci  version  11.0.03)  as  the  2D  algorithm.  Since  we  were  not  able  to  modify 
that  code,  the  2D-Strassen  numbers  are  simulated  based  on  single-node  benchmarks  of  the 
corresponding  local  matrix  multiplication  size.  We  tested  the  2.5D  code  tuned  for  Intrepid 
on  Hopper;  for  large  problems  it  performed  worse  than  ScaLAPACK,  and  since  it  was  not 
tuned  for  Hopper,  we  do  not  show  results  for  2.5D,  Strassen-2.5D,  or  2.5D-Strassen.  We 
would  expect  that  2.5D  code,  properly  tuned  for  Hopper,  would  outperform  2D.  For  the  small 
problem  (n  =  4704),  we  observed  speedups  of  up  to  66%  over  2.5D,  which  happened  to  be 
the  best  of  the  other  algorithms  for  this  problem  size. 

2.5.2  IBM  BlueGene/P  Intrepid 

Intrepid  is  an  IBM  BG/P  at  the  Argonnc  Leadership  Computing  Facility.  It  consists  of 
40,960  compute  nodes,  each  of  which  has  a  quad-core  IBM  PowerPC  450  850  MHz  processor, 
and  2  GB  of  DRAM.  Intrepid’s  peak  double  precision  rate  is  13.6  Gflop/s  per  node,  or  557 
Tflop/s  for  the  entire  machine.  We  obtain  on-node  parallelism  using  the  threaded  BLAS 
implementation  in  IBM’s  ESSL  version  4.4. 1-0.  As  of  November  2011,  it  was  ranked  number 
23  on  the  TOP500  list  [53],  with  a  LINPACK  score  of  459  Tflop/s.  Intrepid  allows  allocations 
only  in  powers  of  two  nodes  (with  a  few  exceptions),  but  in  our  performance  data  we  count 
only  the  nodes  we  use. 

On  Intrepid,  the  most  efficient  classical  code  is  2.5D  and  is  well-tuned  to  the  architecture. 
It  consistently  outperforms  Strassen-2D  and  Strassen-2.5D,  so  we  omit  those  algorithms  in  the 
performance  plots.  The  2D  and  2.5D  code  are  from  [61].  For  the  large  problem  (n  =  65856), 
CAPS  achieves  a  speedup  of  up  to  57%  over  2.5D  or  2.5D-St.rassen;  for  the  small  problem 
(n  =  4704),  the  best  speedup  is  12%. 
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2.5.3  Cray  XT4  Franklin 

Franklin  is  a  recently  retired  Cray  XT4  at  the  National  Energy  Research  Scientific  Computing 
Center.  It  consists  of  9,572  compute  nodes,  each  of  which  has  a  quad-core  AMD  “Budapest” 

2.3  GHz  processor,  and  8  GB  of  DRAM.  Franklin’s  peak  double  precision  rate  is  36.8  Gflop/s 
per  node,  or  352  Tflop/s  for  the  entire  machine.  On  each  node,  we  use  the  threaded  BLAS 
implementation  in  Cray’s  LibSci  version  10.5.02.  As  of  November  2011,  it  was  ranked  number 
38  on  the  TOP500  list  [53],  with  a  LINPACK  score  of  266  Tflop/s  on  a  matrix  of  dimension 
about  1.6  million. 

CAPS  outperforms  all  of  the  previous  algorithms,  and  attains  performance  as  high  as 
33%  above  the  theoretical  maximum  for  classical  algorithms,  as  shown  in  Figure  2.4e.  The 
largest  speedups  we  observed  for  the  large  problem  (n  =  94080)  was  103%  faster  than 
2.5D,  the  fastest  classical  algorithm,  and  187%  faster  than  Strassen-2D,  the  best  previous 
Strassen-based  algorithm.  For  the  small  problem  size  (n  =  3136),  we  observed  up  to  84% 
improvement  over  2.5D,  which  was  the  best  among  the  all  other  approaches.  The  2D  and 
2.5D  code  are  from  [61]. 

For  a  matrix  dimension  of  n  =  188160,  we  observed  an  aggregate  effective  performance 
rate  of  351  Tflop/s  which  exceeds  the  LINPACK  score.  Note  that  for  this  run  CAPS  used 
only  7203  (75%)  of  the  nodes  and  a  matrix  of  less  than  one  eighth  the  dimension  used  for 
the  TOP500  number.  In  fact,  increasing  the  matrix  size  to  n  =  263424  increases  its  effective 
performance  to  388  Tflop/s,  higher  than  Franklin’s  theoretical  peak  for  classical  algorithms. 

2.5.4  CAPS  vs.  Strassen-based  algorithms 

Figure  2.5  compares  the  performance  of  CAPS  with  the  previous  Strassen-based  approaches 
on  Intrepid.  The  plot  shows,  for  a  fixed  matrix  dimension  and  number  of  processors,  both  the 
effective  and  actual  performance  of  the  two  previous  Strassen-based  algorithms  and  CAPS 
over  various  numbers  of  Strassen  steps.  For  a  given  number  of  Strassen  steps,  the  three 
algorithms  do  (almost)  the  same  number  of  flops.  Note  that  since  the  number  of  nodes  is  49, 
CAPS  is  defined  only  for  at  least  2  Strassen  steps. 

For  this  matrix  dimension,  CAPS  attains  highest  effective  performance  (shortest  time 
to  completion)  at  4  Strassen  steps.  We  see  that  the  actual  performance  for  CAPS  (and  the 
other  two  algorithms)  decreases  with  the  number  of  Strassen  steps,  as  it  becomes  harder  to 
do  the  fewer  flops  as  efficiently. 

In  the  case  of  2D-Strassen,  varying  the  number  of  Strassen  steps  means  varying  how  each 
local  matrix  multiplication  is  performed.  For  the  local  matrix  dimension  of  n  =  3136,  two 
Strassen  steps  is  optimal,  and  the  improvement  in  effective  performance  is  modest  because 
the  matrix  dimension  is  fairly  small.  In  the  case  of  Strassen-2D,  both  effective  and  actual 
performance  degrade  with  each  Strassen  step.  This  is  due  to  the  increasing  communication 
costs  of  the  algorithm,  which  outweigh  the  computational  savings. 
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Figure  2.5:  Efficiency  at  various  numbers  of  Strassen  steps,  n  =  21952,  on  49  nodes  (196 
cores)  of  Intrepid.  Effective  performance  is  relative  to  2n3  flops  and  actual  performance  is 
relative  to  the  actual  number  of  flops  performed. 


2.6  Performance  Model 

In  this  section,  we  introduce  a  performance  model  in  order  to  predict  performance  on  a 
distributed-memory  parallel  machine.  We  include  a  single-node  performance  model  to  more 
accurately  represent  local  computation.  The  main  goals  of  the  performance  model  are  to 
validate  the  theoretical  analysis  of  CAPS  to  real  performance,  identify  areas  which  might 
benefit  from  further  optimization,  and  make  predictions  for  performance  on  future  hardware. 

We  choose  to  validate  our  model  on  Intrepid  because  its  performance  is  very  consistent 
(usually  less  than  1%  variation  in  execution  time,  versus  10-20%  on  Hopper)  and  also 
because  we  believe  there  is  opportunity  for  topology-aware  optimizations,  which  we  discuss 
in  Section  2.6.4. 

2.6.1  Single  Node 

Due  to  the  sensitivity  of  Strassen  performance  to  DGEMM  performance  and  the  difficulty  of 
modeling  DGEMM  performance  accurately  for  small  problems,  we  use  a  third  degree  polynomial 
of  best  fit  to  match  the  measured  time  of  ESSL’s  implementation  of  the  classical  algorithm 
(DGEMM).  Besides  making  calls  to  DGEMM,  Strassen’s  algorithm  consists  of  performing  matrix 
additions  which  are  communication  bound.  Thus,  we  measured  the  time  of  DAXPY  per  scalar 
addition,  which  is  fairly  independent  of  matrix  size. 

Let  TDGEHM(n)  be  the  polynomial  for  the  time  cost  of  classical  matrix  multiplication  of 
dimension  n  and  TDAXPY  be  the  cost  per  scalar  addition  for  large  vectors.  We  obtain  the  single 
node  performance  model  for  the  time  cost  of  Strassen’s  algorithm  using  s  steps  of  Strassen 
on  a  problem  of  size  n  as 
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Figure  2.6:  Comparison  of  the  sequential  model  to  the  actual  performance  of  classical  and 
Strassen  matrix  multiplication  on  four  cores  (one  node)  of  Intrepid. 


Tseq(n)  —  min  I  7s  ■  TDGEfffl  ^  9  •  TD  AXPY  •  ^4^  (^-6) 

The  constant  9  comes  from  the  fact  that  in  Strassen-Winograd,  for  each  of  A  and  B ,  four 
sub-matrices  must  be  read  and  four  written  (since  three  outputs  are  copies  of  inputs),  and  to 
compute  C ,  seven  input  matrices  must  be  read  and  four  written;  whereas  TDAXPY  is  essentially 
the  time  to  read  two  words  and  write  one  word.  Alternately,  one  can  make  15  calls  to 
DAXPY,  one  for  each  matrix  addition,  which  yields  a  constant  of  15  but  allows  the  use  of  a 
tuned  subroutine.  We  found  better  performance  using  DAXPY  on  Intrepid,  but  with  enough 
optimization,  an  implementation  based  on  the  first  approach  should  be  more  efficient. 

The  parameters  of  our  single  node  model  (in  seconds)  are:  TDGEMM(n)  =  2.04  •  10~10n3  + 
2.14  •  10_8n2  -  4.18  •  10"6n  +  2.11  •  10"3  and  TDAXPY  =  3.66  •  1(T9. 

We  present  actual  and  modeled  performance  of  both  classical  and  Strassen  performance 
on  a  single  node  in  Figure  2.6.  Note  that  the  classical  model  is  nearly  indistinguishable  from 
the  data  in  the  plot  because  it  is  a  curve  of  best  fit.  By  minimizing  over  s,  the  model  from 
Equation  (2.6)  chooses  the  optimal  cutoff  point  (around  n  =  1000)  to  switch  to  the  classical 
algorithm,  and  the  performance  of  Strassen  matches  the  classical  algorithm  below  that  point. 

I11  Figure  2.7  we  show  a  breakdown  of  time  between  additions  and  multiplications  (calls 
to  DGEMM)  for  both  the  model  and  the  actual  implementation.  For  this  problem  size,  the 
optimal  number  of  Strassen  steps  is  2,  where  the  time  is  almost  completely  dominated  by  the 
multiplications.  Note  that  the  model  predicts  better  performance  for  the  additions  than  the 
implementation  achieves,  but  the  main  determining  factor  for  optimal  number  of  Strassen 
steps  is  the  performance  of  DGEMM  for  the  different  problem  sizes. 
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Figure  2.7:  Time  breakdown  comparison  between  the  sequential  model  and  the  data  for 
n  =  4097.  Both  model  and  data  times  are  normalized  to  the  modeled  classical  algorithm 
time. 


2.6.2  Distributed  Machine 


We  start  with  the  conventional  (a,/3, 7)  performance  model  for  a  distributed- memory  parallel 
algorithm  which  uses  three  machine  parameters:  a  as  the  latency  between  any  two  nodes, 
(3  as  the  inverse  bandwidth  between  any  two  nodes,  and  7  as  the  time  cost  per  flop  on  a 
single  node  [36,  10,  45].  By  counting  flops  /,  words  w,  and  messages  m  along  the  critical 
path  and  summing  up  the  three  terms  with  corresponding  coefficients,  one  can  model  the 
time  cost  of  a  parallel  algorithm  as  am  +  /3w  +  7 /.  The  main  shortcomings  of  this  model  are 
that  it  assumes  an  all-to-all  network  (thus  ignoring  contention  among  processors  for  network 
links  and  the  number  of  hops  a  message  must  take),  it  ignores  overlap  of  computation  and 
communication,  and  it  assumes  the  cost  per  flop  is  constant  on  a  node  (ignoring  on-node 
communication  costs). 

To  overcome  the  third  shortcoming,  we  modify  the  (a,  /3, 7)  model  by  replacing  the  7  term 
with  the  single  node  model  for  the  local  multiplications  (which  may  include  more  Strassen 
steps)  and  using  the  measured  TDAXpy  for  the  time  cost  of  each  scalar  addition  during  the 
parallel  Strassen  steps.  Then  the  time  spent  on  computation  is 


(2.7) 


where  k  =  log7  P  is  the  number  of  BFS  steps  taken,  and 
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is  the  number  of  DFS  steps  necessary  to  fit  in  the  available  memory.  Note  that  the  computation 
is  perfectly  load  balanced  so  that  Tf(n,  P)  —  ^  ■  Tseq(n).  In  the  model  we  allow  k  and  I  to 
be  real  valued  to  give  a  continuous  function,  even  though  the  algorithm  only  makes  sense  for 
integral  values. 

The  number  of  words  and  messages  are  exactly  as  in  [6]: 


w  = 


m  =  7e36k. 


The  distributed  model  is  then: 

T(n,  P )  =  7)(n,  P)  +  7£36 ka  +  Q  p. 

The  parameters  of  the  distributed  model  are  P  —  2.13  •  10-8  and  a  =  2  •  10-6,  measured  in 
seconds. 

We  present  actual  and  modeled  strong  scaling  performance  of  CAPS,  2D  and  2.5D 
in  Figure  2.8  (see  Appendix  A  in  [60]  for  the  classical  performance  model).  The  CAPS 
performance  and  model  match  quite  well  up  to  about  4116  cores,  but  for  runs  on  more 
cores  the  actual  performance  drops  significantly  below  the  predictions  of  the  model.  We 
believe  this  is  due  to  contention;  we  consider  optimizing  CAPS  to  a  3D-torus  network  in 
Section  2.6.4.  The  model  also  allows  us  to  break  down  the  time  into  communication  time 
(the  a  and  P  terms),  time  spent  in  calls  to  DGEMM  (the  first  term  in  Equation  2.7),  and  time 
spent  in  additions  (the  second  term  in  Equation  2.7).  We  compare  these  times  to  the  actual 
time  breakdown  (averaged  over  processors)  in  Figure  2.9.  The  model  works  well  for  small 
values  of  P,  but  understates  the  communication  cost  for  large  values  of  P,  due  to  contention. 
In  fact,  at  P=49,  the  communication  is  slightly  faster  than  predicted  by  the  model,  which  is 
possible  because  the  model  counts  bandwidth  along  only  one  direction  on  one  of  the  six  links 
to  a  given  node,  and  ignores  communication  hiding. 


2.6.3  Exascale  Predictions 

We  model  performance  on  a  hypothetical  exascale  machine  by  counting  words  communicated 
on  the  network,  words  transferred  between  DRAM  and  cache,  and  flops  computed  per 
processor.  For  the  machine  parameters,  we  use  values  from  the  2018  Swimlane  1  extrapolation 
in  [58]: 


Number  of  nodes 

22° 

Flop  rate  per  node 

1  Tflop/s 

Cache  size  per  node 

512  MB 

DRAM  per  node 

32  GB 

Node  memory  bandwidth 

0.4  TB/s 

Network  link  bandwidth 

20  GB/s 

Network  latency 

1  ns 
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Figure  2.8:  Comparison  of  the  parallel  model  with  the  algorithms  in  strong  scaling  of 
n  =  65856  on  Intrepid. 
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Figure  2.9:  Time  breakdown  comparison  between  the  parallel  model  and  the  data.  In  each 
case  the  entire  modeled  execution  time  is  normalized  to  1. 


The  projected  speedups  of  CAPS  over  2.5D  and  2D  are  shown  in  Figure  2.10.  The  horizontal 
scale  is  the  (log  of  the)  number  of  nodes,  and  the  vertical  scale  is  the  (log  of  the)  amount 
of  memory  per  node  used  to  store  a  single  matrix.  Thus  moving  horizontally  in  the  plot 
corresponds  to  weak  scaling,  and  moving  diagonally  downward  corresponds  to  strong  scaling. 
Compared  to  2.5D,  our  largest  speedup  is  5.45x  at  the  top-right  of  the  plot:  very  large 
matrices  run  using  the  entire  machine.  Although  CAPS  communicates  asymptotically  less 
than  2.5D,  the  advantage  is  very  slight,  and  the  constants  for  CAPS  are  larger  than  for  2.5D 
in  our  model.  For  small  problems  (bottom  of  the  figure),  CAPS  is  slightly  faster  when  using 
the  entire  machine  but  slower  for  fewer  processors.  Comparing  to  2D,  which  communicates 
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Figure  2.10:  Predicted  speedups  of  CAPS  over  2.5D  (top)  and  2D  (bottom)  on  an  exascale 
machine. 


much  more  for  small  problems,  there  are  substantial  speedups  of  5.45 x  in  the  top  right,  and 
5.27x  in  the  communication-bound  regime  at  the  bottom  right. 

2.6.4  Areas  of  possible  performance  improvement 

Based  on  our  performance  models  and  benchmarks,  we  believe  there  are  several  areas  in  which 
further  performance  optimizations  will  be  effective.  First,  since  local  computation  dominates 
the  execution  time  for  many  problems,  improving  the  on-node  performance  of  Strassen  can 
help  overall.  By  writing  more  efficient  addition  code  which  exploits  the  shared  operands 
and  decreases  reads  from  DRAM,  we  believe  it  is  possible  to  match  our  modeled  on-node 
performance  (an  improvement  of  around  10%).  Further  improving  the  performance  of  DGEMM 
for  small  problems  would  also  boost  on-node  Strassen  performance.  If  the  performance  curve 
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for  the  classical  algorithm  reaches  its  peak  for  smaller  matrices,  then  the  cutoff  point  can 
be  decreased;  more  Strassen  steps  implies  greater  computational  savings,  so  the  effective 
performance  will  be  improved  for  large  matrices  (using  one  more  Strassen  step  can  improve 
performance  up  to  about  14%  since  it  only  does  7  instead  of  8  multiplications  on  matrices  of 
half  the  size). 

Second,  we  believe  there  are  important  topology-aware  optimization  possibilities.  On 
Intrepid,  where  the  topology  is  known,  one  can  map  processors  to  nodes  in  order  to  minimize 
contention  and  also  maximize  the  use  of  a  node’s  links  in  each  of  the  three  dimensions,  as 
in  [61].  In  most  cases  we  achieve  the  best  performance  by  laying  out  7  processes  onto  7  of 
the  8  nodes  in  a  2  x  2  x  2  cube,  and  then  recursively  using  this  layout  for  higher  powers 
of  7.  Another  natural  mapping  is  to  place  the  7k  processes  in  a  /c-dimensional  grid  so  that 
the  communication  occurs  only  in  disjoint  pencils  (lines  of  7  axis-aligned  processors).  The 
contention  will  then  never  be  worse  than  for  7  processors  communicating  around  a  ring, 
although  only  1/k  of  the  links  will  be  active  at  any  time.  On  Intrepid  this  works  for  k  <  3 
since  it  has  a  3-dimensional  topology  (k  =  3  implies  1372  =  4  •  73  cores). 

A  more  systematic  approach  of  finding  optimal  mappings  may  yield  significant  improve¬ 
ments.  Avoiding  contention  completely  would  enable  performance  to  match  the  performance 
model  (an  improvement  of  around  30%  for  large  P).  Since  the  model  is  based  on  one  link’s 
bandwidth,  optimizing  the  mapping  to  take  advantage  of  multiple  links  can  yield  performance 
which  exceeds  the  model.  For  small  matrices  and  communication-bound  problems,  this  can 
lead  to  significant  performance  improvements. 

Our  implementation  is  somewhat  sensitive  to  matrix  dimension  and  number  of  processors. 
There  are  many  optimizations  which  could  help  smooth  the  performance  curve  for  arbitrary 
n  and  P  which  we  did  not  consider  in  this  work. 


2.7  Implementation  Details 

The  implementation  of  CAPS  follows  the  algorithm  presented  in  Section  2.3.  This  section 
fills  in  several  of  the  details  of  the  implementation. 

2.7.1  Base-case  Multiplication 

Until  now,  we  have  assumed  that  Strassen  will  be  performed  all  the  way  down  to  1  x  1 
matrices.  In  practice,  however,  it  is  faster  to  use  a  classical  matrix  multiplication  algorithm  on 
sizes  below  some  threshold.  See  section  2.6  for  a  discussion  of  what  constitutes  a  reasonable 
threshold. 

So  that  the  base-case  doesn’t  increase  the  communication  cost,  we  demand  that  all 
base-case  matrix  multiplies  are  local  calls  to  DGEMM.  This  means  that  we  must  take  exactly 
k  =  log7  P  BFS  steps,  which  requires  that  there  are  at  least  k  Strassen  steps.  This  requirement 
is  not  very  demanding,  and  in  practice  any  matrix  size  that  scales  well  to  P  processors  will 
benefit  from  more  than  k  Strassen  steps.  We  count  on  the  DGEMM  implementation  to  provide 
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good  shared-memory  parallel  performance,  which  is  reasonable  on  today’s  computers,  but 
may  not  be  as  the  number  of  cores  on  each  node  increases. 

2.7.2  Running  on  P  =  m  ■  7k  Processors 

In  this  section,  we  generalize  the  assumption  that  the  number  of  processors  is  exactly  a  power 
of  7.  This  assumption  is  not  realistic  in  practice,  and  if  we  set  P  to  be  the  largest  power  of  7 
no  larger  than  a  given  allocation,  we  might  lose  up  to  a  factor  of  7  in  performance,  making 
Strassen  slower  than  classical  matrix  multiplication  in  many  cases.  As  shown  in  Section  2.7.3, 
making  the  algorithm  more  practical  and  capable  of  running  on  m  ■  7k  processors  does  not 
sacrifice  the  theoretical  communication  optimality.  Figure  2.4  shows  that  actual  performances 
with  these  generalizations  is  comparable. 

If  we  take  P  =  m  ■  7k,  then  after  k  BFS  steps  (and  perhaps  some  DFS  steps  so  there 
is  enough  memory),  the  problem  is  reduced  to  multiplying  smaller  matrices  on  P  —  m 
processors.  We  have  implemented  two  schemes  for  Strassen  in  such  cases:  perform  either  DFS 
steps  or  what  we  call  hybrid  steps,  followed  by  a  distributed  classical  matrix  multiplication  at 
the  base  case.  In  our  implementation  the  classical  multiplication  uses  a  ID  processor  grid, 
which  performs  well  for  small  m. 

Using  only  DFS  steps,  the  number  of  words  communicated  grows  by  a  factor  of  7/4 
for  every  DFS  step.  If  more  than  one  or  two  DFS  steps  are  taken  the  increase  in  the 
communication  cost  can  be  too  large.  If  too  few  Strassen  steps  are  taken  we  may  miss  the 
arithmetic  savings  that  they  can  provide.  The  situation  is  analogous  to  that  of  2D-Strassen. 

The  alternative  is  a  hybrid  step  on  1  <  m  <  7  processors.  In  a  hybrid  step,  the  7  matrix 
multiplies  of  a  Strassen  step  are  performed  locally  in  groups  of  m,  and  any  leftovers  are  run 
on  all  m  processors.  For  example  if  m  —  2  then  3  of  the  7  multiplications  are  performed 
locally  on  each  processor,  and  the  remaining  one  is  performed  on  both  processors.  Using 
hybrid  steps  recursively,  most  of  the  subproblems  are  computed  locally  by  one  processor,  and 
so  there  is  a  lower  communication  cost. 

In  practice,  the  choice  between  hybrid  steps  and  DFS  steps  on  m  processors  is  best- 
regarded  as  a  tuning  parameter.  Hybrid  steps  are  provably  optimal  (see  Section  2.7.3),  but 
the  extra  communication  from  DFS  steps  overlaps  more  easily  with  the  calls  to  DGEMM  (see 
Section  2.7.6). 

2.7.3  Optimality  of  hybrid  steps 

In  this  section  we  prove  that  CAPS  running  on  P  =  m  ■  7k  using  hybrid  steps  (as  defined 
in  Section  2.7.2)  is  communication  optimal,  up  to  a  constant  factor,  if  m  is  regarded  as  a 
constant.  Given  the  optimality  of  CAPS  using  BFS  and  DFS  steps  proved  in  [6],  we  need 
only  consider  the  case  P  =  m. 

Theorem  2.4.  Performing  Strasseris  matrix  multiplication  using  hybrid  steps  on  m  = 
2,  3, 4,  5,  6  processors  communicates  0(n2)  words  and  requires  0(n2)  memory.  Combined  with 
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the  lower  bounds  of  [9],  this  shows  that  the  algorithm  is  communication  optimal. 

Proof.  The  bandwidth  cost  recurrence  for  a  hybrid  step  on  m  processors  is  W(n,m )  = 
0(n2)  +  ( 7  —  m  )  W  (| ,  m) ,  where  the  hrst  term  is  the  words  communicated  to  redistribute 
the  hrst  rn [7 frn\  subproblems  to  the  m  processors,  and  the  second  term  is  the  words  required 
to  compute  the  remaining  subproblems  in  parallel.  Note  that  for  m  =  2,3,4,  5,6,  we  have 
7  —  m\7/m\  <  4,  and  so  the  solution  to  this  recurrence  is  W(n,m )  =  0(n2).  Further,  the 
extra  memory  used  by  the  algorithm  is  simply  the  amount  of  memory  used  to  store  the  data 
each  processor  receives,  and  so  the  memory  usage  is  also  M  =  0(n2).  □ 

It  is  similarly  possible  to  show  that  the  algorithm  is  communication-optimal,  up  to  a 
constant  factor,  for  any  constant  m  that  has  prime  factors  2,  3,  5,  by  recursively  using  hybrid 
steps.  However  the  constant  in  the  communication  cost  grows  with  m,  and  so  it  is  probably 
not  practical  to  run  on  P  =  m  ■  7k  processors  for  m  much  larger  than  6.  In  general,  given  a 
number  of  processors  that  is  not  a  power  of  7,  the  choice  of  how  many  processors  to  ignore 
and  how  large  to  allow  m  to  be  is  left  to  tuning. 

2.7.4  Interleaving  BFS  and  DFS  steps 

As  argued  in  [6],  it  is  possible  to  achieve  the  bandwidth  lower  bound,  up  to  a  constant  factor, 
using  only  a  simple  scheme  of  I  DFS  steps,  followed  by  k  —  log7  P  BFS  steps,  followed  by 
local  Strassen.  Our  implementation  allows  arbitrary  interleaving  of  BFS  and  DFS  steps, 
which  in  some  cases  provides  a  reduction  in  the  bandwidth  costs.  Computation  of  the  optimal 
interleaving  patterns  can  be  done  once,  offline,  for  each  value  of  k. 

For  example,  when  running  on  16807  =  75  processors,  the  simple  interleaving  patterns  are 
all  optimal  for  certain  memory  sizes.  However  for  intermediate  memory  sizes  it  is  possible  to 
reduce  the  volume  of  communication  by  up  to  about  25%  by  choosing  a  different  interleaving; 
see  Figure  2.11. 

2.7.5  Data  Layout 

The  data  layout  can  be  naturally  divided  into  two  levels:  the  global  data  layout  specifies 
which  process  owns  each  part  of  each  matrix,  and  the  local  data  layout  specifies  in  what 
order  the  data  is  stored  in  the  local  memory  of  a  given  process.  For  the  global  layout  with  7k 
processors,  we  use  a  cyclic  distribution  with  a  processor  grid  of  size  7 Lfc/2J  x  7^’/2^.  Note  that 
this  satisfies  the  properties  given  in  Section  3.2  of  [6].  Thus  the  communication  cost  analysis 
given  there  holds  no  matter  what  choice  we  make  for  the  local  data  layout.  Additionally, 
transformations  between  different  local  layouts  can  be  done  quickly  and  without  any  inter¬ 
processor  communication.  The  local  layout  we  choose  is  that  blocks  of  size  2a7yk/2\  x  2  pfc/21 
are  stored  contiguously,  and  these  blocks  are  ordered  relative  to  each  other  following  recursive 
l/l-Morton  ordering  [54], 

The  entire  layout  can  also  be  thought  of  as  s  levels  of  recursive  Morton  ordering,  followed 
by  cyclic  layout  in  each  of  the  sub-matrices  of  size  We  choose  Morton  ordering  because  it 
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Figure  2.11:  The  memory  and  communication  costs  of  all  252  possible  interleavings  of  BFS 
and  DFS  steps  for  multiplying  matrices  of  size  n=351232  on  P— 16807  processors  using  10 
Strassen  steps.  The  optimal  ones  show  the  trade-off  between  memory  size  and  communication 
cost.  Simple  interleavings  are  those  for  which  all  k  BFS  steps  are  performed  as  a  block. 


Figure  2.12:  Data  layout:  s  levels  of  Morton  ordering  are  used  at  the  top  level,  block-cyclic 
distribution  is  used  at  the  bottom.  Numbers  correspond  to  processors  in  base  7. 


is  a  very  good  fit  to  Strassen’s  algorithm  both  conceptually  and  to  enhance  locality  [1].  Since 
we  choose  to  pack  messages  together  to  minimize  the  number  of  message  sent,  it  is  necessary 
to  re-order  the  data  for  each  communication  step  to  maintain  this  data  layout. 

If  the  matrix  dimension  is  a  multiple  of  2s7Lfc/2J ,  this  layout  is  the  same  as  cyclic  layout. 
Thus  it  is  compatible,  up  to  local  re-ordering,  with  the  block-cyclic  layout  of  ScaLAPACK  in 
this  case  if  the  block  size  is  one.  Moreover,  CAPS  can  work  with  any  block  size  that  divides 
2s^\k/2-\  ,  after  local  re-arrangement.  If  matrices  are  stored  in  another  layout,  re-arranging  them 
to  the  layout  that  CAPS  uses  would  only  increase  the  bandwidth  cost  of  the  multiplication 
by  a  subleading  term. 
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2.7.6  Overlapping  Computation  and  Communication 

We  attempt  to  overlap  computation  and  communication  as  long  as  it  can  be  done  without 
breaking  the  recursive  structure  of  the  algorithm.  First,  during  BFS  steps  the  additions  are 
overlapped  with  the  communication.  For  Strassen-Winograd,  6  of  the  14  factors  require  no 
computation,  so  the  additions  for  the  other  8  can  be  done  while  those  are  transferred.  We 
do  not  attempt  to  overlap  the  communication  of  the  seven  products  with  the  additions  to 
convert  them  into  entries  of  C,  because  it  is  not  clear  how  to  do  this  without  degrading  cache 
performance.  Second,  for  base-case  multiplies  with  m  >  1,  we  overlap  the  communication 
with  the  calls  to  DGEMM.  Finally,  there  is  some  overlap  in  hybrid  BFS  steps,  where  the  details 
of  how  much  we  can  overlap  depend  on  the  exact  value  of  m. 

In  principle,  it  should  be  possible  to  hide  more  of  the  communication  cost,  ideally  by 
performing  some  DGEMM  calls  during  the  communication  of  each  BFS  step.  However  these 
DGEMM  calls  only  appear  deeper  in  the  recursion  tree  of  the  algorithm,  so  to  do  this  would 
require  breaking  the  recursive  structure  of  the  algorithm  (that  is,  breaking  one  recursive  call  up 
into  several  parts,  which  are  performed  when  the  data  they  require  has  been  communicated). 


2.8  Numerical  Stability 

CAPS  has  the  same  stability  properties  as  sequential  versions  of  Strassen.  For  a  complete 
discussion  of  the  stability  of  fast  matrix  multiplication  algorithms,  see  [41,  30].  We  highlight 
a  few  main  points  here.  The  tightest  error  bounds  for  classical  matrix  multiplication 
are  component- wise:  \C  —  C\  <  ne\A\  ■  \B\,  where  C  is  the  computed  result  and  e  is 
the  machine  precision.  Strassen  and  other  fast  algorithms  do  not  satisfy  component- wise 
bounds  but  do  satisfy  the  slightly  weaker  norm-wise  bounds:  ||C  —  C ||  <  /(n)e||A|| ||H||, 
where  ||A||  =  maxy  AtJ  and  /  is  polynomial  in  n  [41],  Accuracy  can  be  improved  with 
the  use  of  diagonal  scaling  matrices:  DiCD3  =  D3AD 2  ■  D^BD?,.  It  is  possible  to  choose 
Di,D2,D3  so  that  the  error  bounds  satisfy  either  Ctj  —  CtJ \  <  f(n)e\\A(i,:)\\\\B(:,j)\\  or 
|| C  —  (7||  <  /(n)e|||A|  •  |R|||.  By  scaling,  the  error  bounds  on  Strassen  become  comparable 
to  those  of  many  other  dense  linear  algebra  algorithms,  such  as  LU  and  QR  decomposition 
[29] .  Thus  using  Strassen  for  the  matrix  multiplications  in  a  larger  computation  will  often 
not  harm  the  stability  at  all. 

Using  fewer  than  log2n  Strassen  steps  improves  the  theoretical  constant  in  the  error 
bound.  More  precisely,  using  s  Strassen  steps,  the  error  bound  for  Strassen-Winograd  given 
in  [41]  is 

||C  -  C||m«  <  (l8*  ((£)’  +  |?)  -  6")  ll-4|U«l|B||m»£ 

where  e  is  the  machine  precision  and  ||A||max  :=  maxjj  \Aij\  is  the  max-norm  of  A. 

However,  as  illustrated  in  [41],  this  theoretical  bound  is  too  pessimistic.  In  Figure  2.13, 
we  show  the  measured  max-norm  absolute  error  compared  to  the  theoretical  bound  for  a 
single  matrix  of  size  n  =  16384  in  double  precision  where  each  entry  is  chosen  uniformly  at 
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Figure  2.13:  Stability  test:  theoretical  error  bound  versus  actual  error  for  n=16384.  Zero 
Strassen  steps  corresponds  to  the  classical  algorithm.  Double  precision  machine  epsilon  is 
2.2  •  10”16. 


random  from  [—1, 1],  varying  the  number  of  Strassen  steps  taken.  For  the  “exact”  answer  we 
compute  the  product  in  quadruple  precision.  To  maximize  performance  on  a  single  node  of 
Hopper  or  Intrepid,  for  example,  the  optimal  number  of  Strassen  steps  for  n  =  16834  is  4, 
where  the  result  loses  about  two  decimal  digits  (measured  by  norm-wise  error)  compared  to 
classical  matrix  multiplication. 

2.9  Parallelizing  Other  Fast  Matrix  Multiplication 
Algorithms 

Our  approach  of  executing  a  recursive  algorithm  in  parallel  by  traversing  the  recursion 
tree  in  DFS  or  BFS  manner  is  not  limited  to  Strassen’s  algorithm.  Recursive  matrix 
multiplication  algorithms  are  typically  built  out  of  ways  to  multiply  Uq  x  rt0  matrices  using 
q  <  Uq  multiplications.  As  with  Strassen  and  Strassen-Winograd,  they  compute  q  linear 
combinations  of  entries  of  each  of  A  and  B ,  multiply  these  pairwise,  then  compute  the 
entries  of  C  as  linear  combinations  of  these.  CAPS  can  be  easily  generalized  to  any  such 
multiplication,  with  the  following  modifications: 

•  The  number  of  processors  P  is  a  power  of  q. 

•  The  data  layout  must  be  such  that  all  blocks  of  A,  B ,  and  C  are  distributed  equally 
among  the  P  processors  with  the  same  layout. 

•  The  BFS  and  DFS  determine  whether  the  q  multiplications  are  performed  simultaneously 
or  in  sequence. 

The  communication  costs  are  then  exactly  as  above,  but  with  =  logn[|  q.  One  special  case 
is  the  classical  square  recursive  algorithm  that  splits  A,  B,  and  C  each  into  4  square  blocks 
and  solves  the  problem  with  8  multiplications  of  those  blocks.  Generalizing  CAPS  to  this 
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Table  2.3:  Asymptotic  bandwidth  costs  of  classical  and  fast,  square  and  rectangular  matrix 
multiplication  in  the  case  of  unlimited  memory.  See  the  text  for  a  description  of  the  fast 
rectangular  algorithm  and  problem  size. 


algorithm  gives  a  communication-optimal  classical  square  matrix  multiplication  algorithm 
with  the  same  asymptotic  costs  as  the  2.5D  algorithm  [31]. 

Using  the  techniques  from  Theorem  3.3  of  [30],  one  can  convert  any  fast  square  matrix 
multiplication  algorithm  into  one  that  is  nearly  as  fast  and  is  of  the  form  above.  Using 
the  CAPS  parallelization  approach  this  gives  a  communication-avoiding  parallel  algorithm 
corresponding  to  any  fast  matrix  multiplication  algorithm.  We  conjecture  that  there  is  a 
matching  lower  bound,  making  these  algorithms  optimal. 

CAPS  can  also  be  generalized  to  fast  rectangular  matrix  multiplication  algorithms  (see 
[42,  14,  27,  50,  44,  43,  28]  and  further  details  in  [20]),  which  are  built  out  of  a  base  case  for 
multiplying  an  m 0  x  n0  matrix  A  with  an  n0  x  p0  matrix  B  to  obtain  an  m 0  x  p0  matrix 
C  using  only  q  <  monoPo  scalar  multiplications.  As  with  the  square  case,  a  lower  bound  is 
known  for  some  of  these  algorithms  [7],  and  its  generalization  is  a  conjecture.  Implementing 
and  benchmarking  the  BFS/DFS  approach  on  any  of  these  algorithms  remains  to  be  done. 

It  is  instructive  to  compare  the  communication  costs  of  fast  rectangular  matrix  multi¬ 
plication  with  those  of  the  communication-optimal  classical  algorithm  that  is  presented  in 
Chapter  3.  Consider  a  fast  algorithm  as  above,  with  m0  >  n0  >  p0  applied  t  times  to  multiply 
matrices  with  dimensions  m],  >  Hq  >  pf0  in  the  unlimited  memory  regime.  The  bandwidth 
costs  of  the  generalization  of  CAPS  and  of  CARMA  (see  Chapter  3)  in  this  situation  are 
shown  in  Table  2.3. 

Note  that  since  p0  <  m0,n0, 


q  <  m0n0po  <  (m0ra0)3/2, 


and  so  the  exponent  of  P  in  the  denominator  in  the  fast  rectangular  case  is  logq  mono  >2/3. 
This  means  that  for  any  given  problem,  on  sufficiently  many  processors,  any  fast  algorithm 
will  communicate  less  than  the  classical  algorithm.  On  a  small  number  of  processors,  however, 
fast  rectangular  algorithms  may  communicate  more  than  the  classical  algorithm  because 
fast  algorithms  require  the  largest  matrix  to  be  communicated,  whereas,  depending  on  the 
number  of  processors,  the  classical  algorithm  may  not. 
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Chapter  3 

Classical  Rectangular  Matrix 
Multiplication 


In  this  chapter  we  focus  on  the  classical  (three  nested  loops)  matrix  multiplication  algorithm. 
For  square  matrices,  the  CAPS  algorithm  immediately  generalizes  to  give  an  optimal  algo¬ 
rithm  as  described  in  Section  2.9.  However  for  rectangular  matrices,  there  are  additional 
complications.  Using  the  classical  algorithm,  there  are  several  ways  to  split  up  the  problem 
into  smaller  matrix  multiplications.  To  obtain  a  communication-optimal  algorithm,  it  is 
necessary  to  combine  three  of  these,  corresponding  to  splitting  each  of  the  three  dimensions, 
as  illustrated  in  Figure  3.1. 

For  the  sequential  case,  it  is  shown  in  [34]  that  splitting  the  largest  dimension  at  each 
recursive  step  asymptotically  minimizes  the  communication  costs.  We  apply  the  BFS/DFS 
approach  to  the  dimension-splitting  recursive  algorithm  to  obtain  a  communication-avoiding 
recursive  matrix  multiplication  algorithm,  CARMA,  which  is  asymptotically  communication- 
optimal  for  any  matrix  dimensions,  number  of  processors,  and  memory  size.  CARMA  is  a 
simple  algorithm.  However,  because  it  is  optimal  across  the  entire  range  of  inputs,  we  find 
cases  where  it  significantly  outperforms  ScaLAPACK. 

The  basic  idea  of  CARMA  is  that  at  each  recursive  step,  the  largest  of  the  three  dimensions 
is  split  in  half,  yielding  two  subproblems.  Depending  on  the  available  memory,  these 
subproblems  are  solved  by  either  a  BFS  step  or  a  DFS  step.  A  simplified  version  of  its 
pseudocode  appears  as  Algorithm  3  (for  more  details,  see  Algorithm  4). 

The  results  in  this  chapter  are  joint  work  with  James  Dernmel,  David  Eliahu,  Armando 
Fox,  Shoaib  Kamil,  Oded  Schwartz  and  Orner  Spillinger,  and  appear  in  [31]. 

Notation 

We  consider  the  case  of  computing  C  =  AB  where  A  is  an  rn  x  k  matrix,  B  is  a  k  x  n 
matrix,  and  C  is  an  rn  X  n  matrix.  In  the  next  two  subsections,  it  will  be  convenient  to 
have  an  ordered  notation  for  the  three  dimensions.  Hence  we  define  d\  =  min (m,n,k), 
d-2  =  median(m,  n,  k),  and  ds  =  max(m,  n,  k).  Both  the  lower  bounds  and  the  communication 
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(a)  Split  along  m 


(b)  Split  along  n 


(c)  Split  along  k 


Figure  3.1:  Three  ways  to  split  rectangular  matrix  multiplication  to  get  two  subproblems, 
shown  in  red  and  blue.  In  each  case,  two  of  the  three  matrices  are  split  between  the 
subproblems,  and  one  is  needed  for  both  subproblems.  The  matrix  that  is  needed  for  both 
subproblems  is  shown  as  two  overlapping  copies. 


Algorithm  3  CARMA,  in  brief 

Input:  A  is  an  m  x  k  matrix,  B  is  a  k  x  n  matrix 

Output:  C  =  AB  is  m  x  n 

1:  Split  the  largest  of  m,n,k  in  half,  giving  two  subproblems 
2:  if  Enough  memory  then 

3:  Solve  the  two  problems  recursively  with  a  BFS 

4:  else 

5:  Solve  the  two  problems  recursively  with  a  DFS 
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costs  of  CARMA  depend  only  on  the  values  of  the  three  dimensions,  not  on  their  order. 
Similarly,  let  M3,  M2,  Mi  be  the  three  matrices  in  increasing  size,  so  M3  has  dimensions  di,  d2 , 
M2  has  dimensions  di,  d3,  and  Mi  has  dimensions  d2,d3. 


3.1  Communication  Lower  Bounds 

To  prove  lower  bounds  on  the  communication  costs,  we  will  assume  that  the  initial/final 
data  layout  of  each  matrix  consists  of  one  copy  of  the  matrix  load  balanced  among  the  P 
processors.  Recall  the  arrangement  of  the  mnk  scalar  products  that  must  be  computed 
into  a  rectangular  prism  from  Section  1.4.  At  least  one  processor  will  perform  at  least  pp- 
multiplications.  Consider  such  a  processor,  and  let  V  be  the  set  of  voxels  corresponding  to 
the  multiplications  it  performs.  Let  |V)|  be  the  size  of  the  projection  of  V  onto  matrix  Mx . 
The  Loomis- Whitney  inequality  [49]  gives  a  lower  bound  on  the  product  of  these  projections: 

IUI-N •  N  >  (3.i) 

There  are  three  cases  to  consider,  depending  on  the  aspect  ratio  of  the  matrices  (see  Figure  3.2 
for  graphical  representations  of  the  cases). 


One  large  dimension 


First,  consider  the  case  of  one  very  large  dimension,  so  that 


2 


dz 

da 


>  P. 


We  consider  three  possibilities  depending  on  the  sizes  of  Vj |  and  Ml- 

If  | Vj  >  5^2p 3 ,  then  the  processor  needs  access  to  at  least  this  many  entries  of  M\.  We 
assumed  that  the  input  and  output  data  layouts  are  exactly  load  balanced,  so  at  most  pA  of 
these  entries  are  stored  by  that  processor  in  the  initial/final  data  layout.  As  a  result,  the 
remaining  AA  must  be  communicated  by  that  processor. 

Similarly,  if  Vj  >  5^p3 ,  then  the  processor  needs  access  to  at  least  this  many  entries  of 
M2.  We  assumed  that  the  input  and  output  data  layouts  are  exactly  load  balanced,  so  at 
most  pA  of  these  entries  are  stored  by  that  processor  in  the  initial/final  data  layout.  As  a 
result,  the  remaining  AA  must  be  communicated  by  that  processor. 

If  |  Vj  |  <  and  \V2\  <  pp3,  we  may  substitute  into  Inequality  3.1  to  obtain 

16 

\V3\  >  ^did2. 

Since  M3  is  load  balanced,  only  pA  of  these  entries  can  be  owned  by  the  processor  in  the 
initial/final  data  layout,  so  for  P  >  2,  at  least 

w  >  l~di d2  =  n(did2) 
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mension 


(b)  Two  large  dimensions  (c)  Three  large  dimensions 


Figure  3.2:  Examples  of  the  three  cases  of  aspect  ratios  on  P  =  8  processors.  The  lowest 
communication  cost  is  attained  if  an  algorithm  divides  the  mxnx  k  prism  of  multiplications 
into  8  equal  sub-prisms  that  are  as  close  to  cubical  as  possible.  If  that  division  involves 
dividing  only  one  of  the  dimensions,  we  call  that  case  one  large  dimension.  If  it  involves 
dividing  only  two  of  the  dimensions,  we  call  that  case  two  large  dimensions.  If  all  three 
dimensions  are  divided,  we  call  that  case  three  large  dimensions.  Note  that  in  the  first  two 
cases,  only  one  processor  needs  access  to  any  given  entry  of  the  largest  matrix,  so  with  the 
right  data  layout  an  algorithm  only  needs  to  transfer  the  smaller  matrices. 


words  must  be  communicated.  The  lower  bound  is  the  minimum  of  these  three  possibilities: 

W  =  Ll  (min  | did2,  • 

By  the  assumption  that  2  ^  >  P,  this  simplifies  to 

W  =  n  (did2)  •  (3.2) 

Since  this  lower  bound  depends  only  on  the  size  of  the  smallest  matrix,  it  is  only  attainable  if 
the  two  larger  matrices  are  distributed  such  that  each  processor  owns  corresponding  entries 
of  them. 

The  latency  lower  bound  is  the  trivial  one  that  there  must  be  at  least  one  message: 
L  = 
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Two  large  dimensions 


Next  consider  the  case  that 


d2G?3 

d\ 


>  P  >  2 


C?3 

d'2 


We  consider  two  posibilities  depending  on  the  size  of  |Vi|. 

If  | V\  >  ,  then  the  processor  needs  access  to  at  least  this  many  entries  of  Mi.  We 

assumed  that  the  input  and  output  data  layouts  are  exactly  load  balanced,  so  at  most  of 
these  entries  are  stored  by  that  processor  in  the  initial/final  data  layout.  As  a  result,  the 
remaining  cl^pL  must  be  communicated  by  that  processor. 

If  |  V\  |  <  ^r1,  we  may  substitute  for  \Vi  in  Inequality  3.1  to  obtain 


N  •  |F3|  > 


2dld2d3 

3P 


(3-3) 


It  follows  that 

max{ | V"2 1 ,  |Va|}  >  \j\^ dId^lL, 

The  amount  of  data  a  processor  stores  of  M2  or  M3  in  the  initial/final  layout  is  at  most 
(recall  that  d\  <  d2  <  d3),  thus  we  obtain  a  bandwidth  lower  bound  of 


W  > 


d\d3 

~P~' 


By  the  assumption  that  P  >  2^,  the  first  term  dominates  and  this  simplifies  to 


W  > 


1  \  /  d\  d2d3 

71 J  V- P~ 


nU 


The  lower  bound  is  the  minimum  of  these  two  possibilities: 

W  =  Lt(  min 


d2  d3  d2d3 


P 


P 


By  the  assumption  that  >  P,  this  simplifies  to 

w  =  (3.4) 

Note  that  this  lower  bound  is  only  attainable  if  the  largest  matrix  is  distributed  among  the 
processors  in  such  a  way  that  it  doesn’t  need  to  be  communicated. 

The  latency  lower  bound  is  the  trivial  one  that  there  must  be  at  least  one  message: 
L  =  fi(l). 
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Three  large  dimensions 


If 


then  by  Theorem  3.1  of  [45],  we  have 


P  >  2 


d2d 


2«3 


d\ 


5 


In  the  case  that 


this  bound  can  be  simplified  to 


w  >  -  M. 

“  2 y/2Py/M 

did2d3\2/3 


M  < 


4  P 


w  =  n(  dpE] . 

\pVmJ 

Additionally,  following  the  methods  of  [5],  Inequality  3.1  implies  that 

d,d2d3  \  ^ 


(3.5) 


max{ | Vi | ,  |V2|,  |V3|}  > 


P 


By  the  assumption  of  load  balance,  the  amount  of  data  available  to  one  processor  in  the 
initial/final  layout  of  any  of  the  matrices  is  at  most  (recall  that  d\  <  d2  <  d3),  so  this 
corresponds  to  a  bandwidth  cost  of  at  least 


W  > 


d\d2d3\  ^  d2d3 


P 


P 


By  the  assumption  that  P  >  2^^,  the  hrst  term  dominates,  and 

d3d2d3\  ^ 


w  =  n 


p 


(3.6) 


Equation  3.5  only  applies  for  M  <  (dlf'pd3)~^3 ,  but  for  larger  M,  the  bound  in  Equation  3.6 
dominates  it.  Thus  the  lower  bound  may  be  concisely  expressed  as  their  sum: 


W 


d\d2d3  f  d\d2d3  \  ^  \ 

~p7Wi  +  \1^)  )' 


(3.7) 


This  bound  is  attainable  for  any  load  balanced  data  layout,  since  it  is  larger  than  the  cost  of 
words  to  redistribute  the  data. 
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p  <  da 

d*2 

“1  large  dimension” 

^3  <  <  ^2^3 

d2  df 

“2  large  dimensions” 

^2^3  p 

“3  large  dimensions” 

Lower  Bound  [45,  31] 

d\d2 

/  dfd2d3 

V  p 

did2d3  i  i 
Ps/M  ^  1 

(  did2d3  ) 
{  P  ) 

2/3 

i 

2D  SUMMA  [2,  36] 

/  dfd2d3 

V  p 

/ d({d2d3 

V  p 

/ d\d2d3 

V  P 

3D  SUMMA  [57] 

/ dfd2d3 

V  P 

/ dfd2d3 

V  P 

d\d2d3  i  i 
Ps/M  +  1 

(  d\d2d3 

l  P  ) 

2/3 

1 

CARMA  [31] 

d\d-2 

/  d\d2d3 

V  P 

did2d3  ,  | 
Ps/M  ^  1 

(  did2d3  \ 

l  P  ) 

2/3 

1 

Table  3.1:  Asymptotic  bandwidth  costs  and  lower  bounds  for  rectangular  matrix  multiplication 
on  P  processors,  each  with  local  memory  size  M,  where  the  matrix  dimensions  are  d\  <  (R  < 
d3.  2D  SUMMA  and  3D  SUMMA  have  the  variant  and  processor  grid  chosen  to  minimize 
the  bandwidth  cost. 


The  latency  lower  bound  is  the  combination  of  two  trivial  bounds:  at  least  one  message 
must  be  sent,  and  the  maximum  message  size  is  M.  Therefore  the  number  of  messages  is: 


L 


n 


(  did^ds 
\PM 3/2  + 


Note  that  whenever  P  =  0  Equations  3.7  and  3.4  give  the  same  bound.  Similarly, 

whenever  P  =  0  Equations  3.4  and  3.2  give  the  same  bound.  We  may  thus  drop  the 

factors  of  2  in  the  definitions  of  one,  two,  and  three  large  dimensions. 


3.2  CARMA  Algorithm 

Detailed  pseudocode  for  CARMA  is  shown  in  Algorithm  4  on  page  46.  The  recursion  cuts 
the  largest  dimension  in  half  to  give  two  smaller  subproblems.  At  each  level  of  the  recursion 
in  CARMA,  a  decision  is  made  between  making  a  depth-first  step  (DFS)  or  a  breadth-first 
step  (BFS)  to  solve  the  two  subproblems.  A  BFS  step  consists  of  two  disjoint  subsets  of 
processors  working  independently  on  the  two  subproblems  in  parallel.  In  contrast,  a  DFS  step 
consists  of  all  processors  in  the  current  subset  working  on  each  subproblem  in  sequence.  A 
BFS  step  increases  memory  usage  by  a  constant  factor,  but  decreases  future  communication 
costs.  On  the  other  hand,  a  DFS  step  decreases  future  memory  usage  by  a  constant  factor, 
but  increases  future  communication  costs.  On  P  processors,  with  unlimited  memory,  we  show 
that  the  algorithm  only  needs  BFS  steps  and  is  communication-optimal. 

If  the  execution  of  only  BFS  steps  causes  the  memory  requirements  to  surpass  the  bounds 
of  available  memory,  it  is  necessary  to  interleave  DFS  steps  within  the  BFS  steps  to  limit 
memory  usage.  We  show  that  the  resulting  algorithm  is  still  communication-optimal,  provided 
the  minimal  number  of  DFS  steps  is  taken.  As  we  show  in  the  analysis  in  Section  3.2.1,  at 
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most  a  factor  of  3x  extra  memory  is  needed  in  the  cases  of  one  or  two  large  dimensions, 
so  DFS  steps  are  not  necessary  in  these  cases.  When  all  dimensions  are  large,  the  extra 
memory  requirement  may  grow  asymptotically.  Therefore  the  memory  may  become  a  limiting 
resource  in  that  case  and  DFS  steps  may  be  necessary.  In  our  experiments,  we  used  relatively 
small  matrices  on  many  processors  so  that  the  problem  is  communication-bound  rather  than 
computation-bound.  As  a  result,  the  matrices  are  small  enough  that  memory  was  not  a 
limiting  factor,  and  only  BFS  steps  were  used. 

See  Section  3.4.3  and  Figure  3.4  for  a  description  of  the  data  layout  in  the  distributed- 
memory  case. 


3.2.1  Communication  cost  of  CARMA 

CARMA  will  perform  a  total  of  log2  P  BFS  steps,  possibly  with  some  DFS  steps  interleaved. 
There  are  again  three  cases  to  consider.  In  each  case,  CARMA  attains  the  bandwidth  lower 
bound  up  to  a  constant  factor,  and  the  latency  lower  bound  up  to  a  factor  of  at  most  log  P. 
In  the  previous  subsection,  we  defined  one  large  dimension,  two  large  dimensions,  and  three 
large  dimensions  asymptotically.  The  lower  bounds  are  “continuous”  in  the  sense  that  they 
are  equivalent  for  parameters  within  a  constant  factor  of  the  threshold,  so  the  precise  choice  of 
the  cutoff  does  not  matter.  In  this  section,  we  define  them  precisely  by  CARMA’s  behavior. 


One  large  dimension 


If  P  <  then  there  are  no  DFS  steps,  only  one  dimension  is  ever  split,  and  the  smallest 
matrix  is  replicated  at  each  BFS  step.  The  communication  cost  is  the  cost  to  send  this  matrix 
at  each  step: 


O  (di^) , 


since  d\d<ijP  is  the  initial  amount  of  data  of  the  smallest  matrix  per  processor,  and  it  increases 
by  a  factor  of  2  at  each  BFS  step.  In  this  case  the  BFS  steps  can  be  thought  of  as  performing 
an  all-gather  or  reduce-scatter  on  the  smallest  matrix.  By  having  the  initial/final  data 
layout  be  distributed  across  the  processors,  the  BFS  approach  avoids  the  logP  factor  in 
the  bandwidth  that  a  broadcast  or  reduce  would  incur.  When  d\d-2  <  P ,  there  is  still  a 
logarithmic  factor,  and  the  cost  is  W  =  O  (dida  log  ■ 

The  memory  use  is  the  memory  required  to  hold  the  input  and  output,  plus  the  memory 
required  to  hold  all  the  data  received,  so 


M  =  0 


d\d2  +  did 3  +  dads 
_ 


+  d\d'2  )  —  O 


dads 

~~P~ 


At  most  a  constant  factor  of  extra  memory  is  required.  In  fact  the  constant  factor  is  quite 
small:  since  d\d2  <  d\ds/P  <  dads/ P,  it  uses  at  most  a  factor  of  3/2  as  much  memory  as  is 
required  to  store  the  input  and  output. 
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Algorithm  4  CARMA (A,B  ,C  ,m,k,n,P) 


Input:  A  is  an  m  x  k  matrix  and  B  is  a  k  x  n  matrix 

Output:  C  =  AB 

1:  if  P  =  1  then 

2:  SequentialMultiply(  A,  B ,  C,  m,  k,  n  ) 

3:  else  if  Enough  Memory  then  >  Do  a  BFS 

4:  if  n  is  the  largest  dimension  then 

5:  Copy  A  to  disjoint  halves  of  the  processors. 

>  Processor  i  sends  and  receives  local  A  from  processor  i  ±  P/2 


6 

7 

8 

9 

10 

11 

12 

13 

14 

15 

16 

17 

18 

19 

20 
21 
22 

23 

24 

25 

26 

27 

28 

29 

30 


parallel  for  do 

CARMA  (A,  Rieft,  Cleft,  m,  k,  n/2,  P/2 ) 

CARMA  (A,  -Bright ,  Cright,  m,  k,  n/2,  P/2) 
if  m  is  the  largest  dimension  then 

Copy  B  to  disjoint  halves  of  the  processors. 

>  Processor  i  sends  and  receives  local  B  from  processor  %  ±  P/2 

parallel  for  do 

CARMA(AtoP,  B,  Ctop,  m/2,  k,  n,  P/2) 

CARMA (Abot,  B,  Cbot,  m/2,  k,  n,  P/2) 
if  k  is  the  largest  dimension  then 
parallel  for  do 

CARMA (Aleft,  Btop,C,  m,  k/2,  n,  P/2) 

CARMA (Aright,  Bbot,C,  m,  k/2,  n,  P/2) 

Gather  C  from  disjoint  halves  of  the  processors. 

>  Processor  i  sends  C  and  receives  C"  from  processor  i  ±  P/2 

c  c  + a 

else  >  Do  a  DFS 

if  n  is  the  largest  dimension  then 

CARMA  (A,  Bieft,  Ceft,  m,  k,  n/2,  P) 

CARMA  (A,  Bright,  Cright,  m,  k,  n/2,  P) 
if  m  is  the  largest  dimension  then 

CARMA(AtoP,  B,  Ctop,  m/2,  k,  n,  P) 

CARMA (Abot,  B,  Cbot,  m/2,  k,  n,  P) 

if  k  is  the  largest  dimension  then 

CARMA (Aieft,  Btop,  C,  m,  k/2,  n,  P) 

CARMA  (Aright,  Bbot,  C,  m,  k/2,  n,  P) 
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The  number  of  messages  sent  at  each  BFS  is  constant,  so  the  latency  cost  is  L  =  O(logP). 


Two  large  dimensions 


Next  consider  the  case  that 


d:i 

d2 


<  P  < 


"dp 


There  will  be  two  phases:  for  the  first  log2  ^  BFS  steps,  the  original  largest  dimension  is  split; 
then  for  the  remaining  log2  BFS  steps,  the  two  original  largest  dimensions  are  alternately 
split.  Again,  no  DFS  steps  are  required.  The  bandwidth  cost  of  the  first  phase  is 


Wi  =  O 


/log2|-i  \ 
«i«2  2 


V 


i= 0 


P 


=  o 


) 


d\ 


The  bandwidth  cost  of  the  second  phase  is 


/  §  loS2 


W2  =  O 


X 

i= 0 


d\d2 

P  d2/ d'3 


\ 


=  O 


dr j  d2d.3 

P 


since  every  two  BFS  steps  increases  the  amount  of  data  being  transferred  by  a  factor  of  2. 

The  cost  of  the  second  phase  dominates  the  cost  of  the  first.  Again,  the  memory  use  is 
the  memory  required  to  hold  the  input  and  output,  plus  the  memory  required  to  hold  all  the 
data  received,  so 


M  =  O 


d\d2  +  d\d^  +  d2d3 
P 


+ 


d^  d2  c?3 

P 


=  O 


d2d3 

~P~ 


At  most  a  constant  factor  of  extra  memory  is  required,  justifying  our  use  of  BFS  only.  In 
fact,  by  the  assumptions  of  this  case,  the  algorithm  never  uses  more  than  3  times  the  amount 
of  memory  used  to  store  the  input  and  output. 

There  are  a  constant  number  of  messages  sent  at  each  BFS  step,  so  the  latency  cost  is 

L  =  0(\ogP). 


Three  large  dimensions 

Finally,  consider  the  case  that  P  >  dMs..  The  first  phase  consists  of  log2  ^  BFS  steps  splitting 
the  largest  dimension,  and  is  exactly  as  in  the  previous  case.  The  second  phase  consists 
of  2  log2  ^  BFS  steps  alternately  splitting  the  two  original  largest  dimensions.  After  this 

Pd? 

phase,  there  are  P3  =  processors  working  on  each  subproblem,  and  the  subproblems  are 
multiplication  where  all  three  dimensions  are  within  a  factor  of  2  of  each  other.  CARMA 
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splits  each  of  the  dimensions  once  every  three  steps,  alternating  BFS  and  DFS  to  stay  within 
memory  bounds,  until  it  gets  down  to  one  processor. 

The  cost  of  the  first  phase  is  exactly  as  in  the  previous  case.  The  bandwidth  cost  of  the 
second  phase  is 
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In  the  final  phase,  the  cost  is  within  a  factor  of  4  of  the  square  case,  which  was  discussed 
in  Section  2.9,  giving 
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while  remaining  within  memory  size  M .  W3  is  the  dominant  term  in  the  bandwidth  cost. 

The  latency  cost  of  the  first  two  phases  is  log  ,  and  the  latency  cost  of  the  third  phase 
is 
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giving  a  total  latency  cost  of 
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3.3  Performance  Results 

We  have  implemented  CARMA  in  C++  with  MPI.  We  benchmark  on  Hopper1,  a  Cray  XE6 
at  the  National  Energy  Research  Scientific  Computing  Center  (NERSC).  It  consists  of  6,384 
compute  nodes,  each  of  which  has  2  twelve-core  AMD  “Magny-Cours”  2.1  GHz  processors 
and  32  GB  of  DRAM  (384  of  the  nodes  have  64  GB  of  DRAM).  The  24  cores  are  divided 
between  4  NUMA  regions. 

CARMA  gets  the  best  performance  when  run  as  “flat  MPI” ,  with  one  MPI  process  per 
core.  Local  sequential  matrix  multiplications  are  performed  by  calls  to  Cray  LibSci  version 
11.1.00.  The  distributed- memory  version  of  CARMA  supports  splitting  by  arbitrary  factors 
at  each  recursive  step  rather  than  just  by  a  factor  of  2.  For  each  data  point,  several  splitting 
factors  and  orders  were  explored  and  the  one  with  the  best  performance  is  shown.  It  is 
possible  that  further  performance  improvements  are  possible  by  exploring  the  search  space 


Wor  machine  details,  see  http://www.nersc.gov/users/computational-systems/hopper 
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(b)  Time  breakdown,  192  x  6291456  x  192. 
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(d)  Time  breakdown,  12288  x  192  x  12288. 


(e)  Strong  scaling,  square  n  =  6144.  (f)  Time  breakdown,  square  n  =  6144. 

ScaLAPACK  -•“•-CARMA  - Floating  Point  Peak 

Figure  3.3:  CARMA  compared  to  ScaLAPACK  on  Hopper.  Left  column:  Strong  scaling  of 
performance.  Right  column:  CPU-time  breakdown  summed  over  all  cores  (so  perfect  strong 
scaling  would  correspond  to  equal  heights  at  24  and  6144  cores). 
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more  thoroughly.  For  a  description  of  the  data  layout  used  by  distributed  CARMA,  see 
Section  3.4.3. 

We  compare  CARMA  against  ScaLAPACK  version  1.8.0  as  optimized  by  NERSC.  ScaLA- 
PACK  also  uses  LibSci  for  local  multiplications.  For  each  data  point  we  explore  several 
possible  executions  and  show  the  one  with  the  highest  performance.  First,  we  try  running 
with  1,  6,  or  24  cores  per  MPI  process.  Parallelism  between  cores  in  a  single  process  is 
provided  by  LibSci.  Second,  we  explore  all  valid  processor  grids.  We  also  try  storing  the 
input  matrices  as  stated,  or  transposed.  In  some  cases  storing  the  transpose  of  one  of  the 
inputs  increases  ScaLAPACK’s  performance  by  more  than  a  factor  of  10. 

The  topology  of  the  allocation  of  nodes  on  Hopper  is  outside  the  user’s  control,  and,  for 
communication-bound  problems  on  many  nodes,  can  affect  the  runtime  by  as  much  as  a 
factor  of  2.  We  do  not  attempt  to  measure  this  effect.  Instead,  for  every  data  point  shown, 
the  CARMA  and  ScaLAPACK  runs  were  performed  during  the  same  reservation  and  hence 
using  the  same  allocation. 

We  benchmark  three  shapes  of  matrices  corresponding  to  the  three  cases  in  the  communi¬ 
cation  costs  in  Table  3.1.  For  the  case  of  one  large  dimension,  we  benchmark  m  —  n  —  192, 
k  =  6291456.  The  aspect  ratio  is  very  large  so  it  is  in  the  one  large  dimension  case 
{k/P  >  m,n )  even  for  our  largest  run  on  P  =  24576  cores.  In  this  case  we  see  improvements 
of  up  to  140  x  over  ScaLAPACK.  This  data  is  shown  in  Figure  3.3a.  If  ScaLAPACK  is  not 
allowed  to  transpose  the  input  matrices,  the  improvement  grows  to  2500  x. 

For  the  case  of  two  large  dimensions,  we  benchmark  m  —  n  —  24576,  k  =  192.  In  this 
case  both  CARMA  and  ScaLAPACK  (which  uses  SUMMA)  are  communication-optimal,  so 
we  do  not  expect  a  large  performance  difference.  Indeed  performance  is  close  between  the 
two  except  on  very  large  numbers  of  processors  (the  right  end  of  Figure  3.3c)  where  CARMA 
is  nearly  2x  faster. 

Finally  for  the  case  of  three  large  dimensions,  we  benchmark  m  =  n  =  k  =  6144.  For  small 
numbers  of  processors,  the  problem  is  compute-bound  and  both  CARMA  and  ScaLAPACK 
perform  comparably.  For  more  than  about  1000  cores,  CARMA  is  faster,  and  on  24576  cores 
it  is  nearly  3x  faster.  See  Figure  3.3e. 

The  right-hand  column  of  Figure  3.3  shows  the  breakdown  of  time  between  computation 
and  communication  for  CARMA  and  ScaLAPACK,  for  each  of  these  matrix  sizes,  and  for  24 
cores  (1  node)  and  6144  cores  (256  nodes).  In  the  case  of  1  large  dimension  on  6144  cores, 
CARMA  is  16  x  faster  at  the  computation,  but  more  than  1000  x  faster  at  the  communication. 
CARMA  is  faster  at  the  computation  because  the  local  matrix  multiplications  are  as  close  to 
square  as  possible  allowing  for  more  efficient  use  of  the  cache.  For  the  other  two  sizes,  the 
computation  time  is  comparable  between  the  two,  but  CARMA  spends  about  3.5  x  less  time 
on  communication  on  6144  cores. 

All  tests  are  for  multiplication  of  randomly  generated  double  precision  matrices.  For  each 
algorithm  and  size,  one  warm-up  run  was  performed  immediately  before  the  benchmark. 
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3.4  Remarks 

CARMA  is  the  first  distributed-memory  parallel  matrix  multiplication  algorithm  to  be 
communication-optimal  for  all  dimensions  of  matrices  and  sizes  of  memory.  We  prove 
CARMA’s  communication  optimality  and  compare  it  against  ScaLAPACK.  Despite  its  simple 
implementation,  the  algorithm  minimizes  communication,  yielding  performance  improvements 
of  2x  to  140  x.  As  expected,  our  best  improvement  comes  in  ranges  where  CARMA  achieves 
lower  bounds  on  communication  but  previous  algorithms  do  not. 

3.4.1  Opportunities  for  Tuning 

The  algorithm  described  in  Section  3.2  always  splits  the  largest  dimension  by  a  factor  of  2. 
This  can  be  generalized  considerably.  At  each  recursive  step,  the  largest  dimension  could  be 
split  by  any  integer  factor  s,  which  could  vary  between  steps.  Increasing  s  from  2  decreases 
the  bandwidth  cost  (by  at  most  a  small  constant  factor)  while  increasing  the  latency  cost. 
The  choice  of  split  factors  is  also  affected  by  the  number  of  processors,  since  the  product  of 
all  split  factors  at  BFS  steps  must  equal  the  number  of  processors.  Additionally,  when  two 
dimensions  are  of  similar  size,  either  one  could  be  split.  As  long  as  the  s  are  bounded  by 
a  constant,  and  the  dimension  that  is  split  at  each  step  is  within  a  constant  factor  of  the 
largest  dimension,  a  similar  analysis  to  the  one  in  Section  3.2.1  shows  that  CARMA  is  still 
asymptotically  communication-optimal.  Note  that  this  means  that  CARMA  can  efficiently 
use  any  number  of  processors  that  does  not  have  large  prime  factors,  by  choosing  split  factors 
s  that  factor  the  number  of  processors. 

In  practice,  however,  there  is  a  large  tuning  space,  and  more  performance  improvements 
may  be  possible  by  exploring  this  space  further.  Our  implementation  allows  the  user  to 
choose  any  dimension  to  split  and  any  split  factor  at  each  recursive  step  (but  the  required 
data  layout  will  vary;  see  Section  3.4.3).  On  Hopper,  we  have  found  that  splitting  6  or  8 
ways  at  each  step  typically  performs  better  than  splitting  2  ways,  but  we  have  not  performed 
an  exhaustive  search  of  the  tuning  space. 


3.4.2  Perfect  Strong  Scaling  Range 


Recall  the  definition  of  perfect  strong  scaling  from  Section  1.1.  In  the  square  case,  the  2.5D 
algorithm  and  the  square  BFS/DFS  algorithm  exhibit  perfect  strong  scaling  in  the  range 
P  =  Q(n2/M)  and  P  =  0(n3/M3//2),  which  is  the  maximum  possible  range.  Similarly,  in  the 
case  of  three  large  dimensions,  defined  by 


P 


~W’)’ 


both  CARMA  and  3D-SUMMA  exhibit  perfect  strong  scaling  in  the  maximum  possible  range 
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Note  that  in  the  plots  shown  in  this  paper,  the  entire  problem  fits  on  one  node,  so  the  range 
degenerates  to  just  P  —  1. 

In  the  case  of  one  or  two  large  dimensions,  the  bandwidth  lower  bound  does  not  decrease 
linearly  with  P  (see  Table  3.1).  As  a  result,  perfect  strong  scaling  is  not  possible.  Figure  3.3a 
shows  very  good  strong  scaling  for  CARMA  in  practice  because,  even  though  the  bandwidth 
cost  does  not  decrease  with  P  in  this  case,  it  is  small  enough  that  it  is  not  dominant  up  to 
6144  cores  (see  Figure  3.3b). 

3.4.3  Data  Layout  Requirements 

Recall  the  three  cases  of  the  bandwidth  cost  lower  bounds  from  Section  3.1.  In  the  case  of 
three  large  dimensions,  the  lower  bound  is  higher  than  the  size  of  the  input  and  output  data 
per  processor:  rnn+n^+mk  _  This  means  it  is  possible  to  attain  the  bandwidth  lower  bound 
with  any  load  balanced  initial/final  data  layout,  since  the  bandwidth  cost  of  redistributing 
the  data  is  sub-dominant. 

However,  in  the  case  of  one  or  two  large  dimensions,  the  bandwidth  cost  lower  bound 
is  lower  than  the  size  of  the  input  and  output  data  per  processor.  This  means  that  a 
communication-optimal  algorithm  cannot  afford  to  redistribute  the  largest  matrix,  which 
limits  the  data  layouts  that  can  be  used.  For  example,  in  the  case  of  one  large  dimension, 
where  CARMA  shows  its  greatest  advantage,  it  is  critical  that  only  entries  of  the  smallest- 
matrix  ever  be  communicated.  As  a  result,  it  is  necessary  for  corresponding  entries  of  the 
two  larger  matrices  to  be  on  the  same  processor  in  the  initial/final  data  layout. 

CARMA  only  communicates  one  of  the  three  matrices  at  each  BFS  step.  It  requires  that 
each  of  the  two  halves  of  the  other  two  matrices  already  resides  entirely  on  the  corresponding 
half  of  the  processors.  See  Figure  3.4.  This  requirement  applies  recursively  down  to  some 
block  size,  at  which  point  CARMA  uses  a  cyclic  data  layout  (any  load  balanced  layout  would 
work  for  the  base  case).  The  recursive  data  layout  that  the  distributed  version  of  CARMA 
uses  is  different  from  any  existing  linear  algebra  library;  hence  CARMA  cannot  be  directly 
incorporated  into,  for  example,  ScaLAPACK  or  Elemental.  Changing  the  data  layout  before 
or  after  calling  CARMA  would  defeat  its  advantage,  which  comes  from  not  communicating 
the  largest  matrices  at  all. 

In  fact,  even  if  a  new  library  is  designed  for  CARMA,  there  is  a  complication.  If  a  matrix 
is  used  multiple  times  in  a  computation,  sometimes  as  the  largest  and  sometimes  not  the 
largest,  the  data  layouts  CARMA  prefers  will  not  be  consistent.  It  should  still  be  possible  to 
asymptotically  attain  the  communication  lower  bound  for  any  sequence  of  multiplications 
by  choosing  the  correct  initial  or  final  layout  and  possibly  transforming  the  layout  between 
certain  multiplications.  Doing  so  in  a  way  that  makes  the  library  easy  to  use  while  remaining 
efficient  is  left  as  an  open  problem. 
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Figure  3.4:  Data  layout  for  a  BFS  step  splitting  dimension  k.  Before  the  BFS  step,  all  three 
matrices  are  distributed  on  P  processors.  The  distributed  code  assumes  that  ^4ieft  and  Btop 
are  distributed  among  the  first  P/2  processors,  height  and  B^ot  are  distributed  among  the 
remaining  P/2  processors,  and  C  is  distributed  among  all  the  processors.  The  layout  applies 
recursively,  following  the  execution  pattern,  and  in  the  base  case  the  layout  is  cyclic. 
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Chapter  4 

Sparse  Matrix  Multiplication 


A  sparse  matrix  is  one  for  which  o(n2)  entries  are  nonzero,  and  only  the  nonzero  entries 
are  explicitly  stored.  This  means  that,  when  applying  the  classical  matrix  multiplication 
algorithm,  among  the  n3  multiplications  that  are  performed  in  the  dense  case,  only  those 
for  which  the  corresponding  entries  of  A  and  B  are  both  nonzero  must  be  performed.  Here 
we  only  consider  algorithms  where  every  nonzero  product  is  computed  directly,  and  not  fast 
sparse  algorithms  such  as  the  one  of  Yuster  and  Zwick  [67] . 

The  results  in  this  chapter  are  joint  work  with  Grey  Ballard,  Aydm  Bulug,  James  Demmel, 
Laura  Grigori,  Oded  Schwartz  and  Sivan  Toledo;  the  algorithms  and  analysis  appear  in  [4], 

Achieving  scalability  for  parallel  algorithms  for  sparse  matrix  problems  is  challenging 
because  the  computations  tend  not  to  have  the  potential  for  @(\/T7)  data  re-use  that  is 
common  in  dense  matrix  problems.  Further,  the  performance  of  sparse  algorithms  is  often 
highly  dependent  on  the  sparsity  structure  of  the  input  matrices.  We  show  in  this  chapter 
that  previous  algorithms  for  sparse  matrix-matrix  multiplication  are  non  optimal  in  their 
communication  costs,  and  we  obtain  new  algorithms  which  are  communication  optimal, 
communicating  less  than  the  previous  algorithms  and  matching  new  lower  bounds. 

Our  lower  bounds  require  two  important  assumptions:  (1)  the  sparsity  of  the  input- 
matrices  is  random,  corresponding  to  Erdos- Renyi  random  graphs  (see  Definition  4.1),  and  (2) 
the  algorithm  is  sparsity-independent ,  where  the  partitioning  of  the  computation  among  the 
processors  is  independent  of  the  sparsity  structure  of  the  input  matrices  (see  Definition  4.4). 
The  second  assumption  applies  to  nearly  all  existing  algorithms  for  general  sparse  matrix- 
matrix  multiplication.  While  a  priori  knowledge  of  sparsity  structure  can  certainly  reduce 
communication  for  many  important  classes  of  inputs,  dynamically  determining  and  efficiently 
exploiting  the  structure  of  general  input  matrices  is  a  challenging  problem.  In  fact,  a  common 
technique  of  current  library  implementations  is  to  randomly  permute  rows  and  columns  of 
the  input  matrices  in  an  attempt  to  destroy  their  structure  and  improve  computational  load 
balance  [17,  19].  Because  the  input  matrices  are  random,  our  analyses  are  in  terms  of  expected 
communication  costs. 
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4.1  Preliminaries 

For  sparse  matrix  indexing,  we  use  the  colon  notation,  where  A(:,i)  denotes  the  ith  column, 
A(i, :)  denotes  the  ?’tli  row,  and  A(i,j)  denotes  the  element  at  the  (i,  j)th  position  of  matrix 

A.  We  use  nnz(-)  to  denote  the  number  of  nonzeros  in  a  matrix  or  submatrix. 

We  consider  the  case  where  A  and  B  are  n  x  n  ER(rf)  matrices: 

Definition  4.1.  An  ER(d)  matrix  is  an  adjacency  matrix  of  an  Erdos-Renyi  graph  with 
parameters  n  and  d/n.  That  is,  an  ER(d)  matrix  is  a  square  matrix  of  dimension  n  where 
each  entry  is  nonzero  with  probability  d/n.  We  assume  d  <C  yfn. 

It  is  not  important  for  our  analysis  to  which  semiring  the  matrix  entries  belong,  though  we 
assume  algorithms  do  not  short-circuit  summations  or  exploit  cancellation  in  the  intermediate 
values  or  output  entries.  In  this  case,  the  following  facts  will  be  useful  for  our  analysis. 

Fact  4.2.  Let  A  and  B  be  n  x  n  ER(d)  matrices.  Then 

(a)  the  expected  number  of  nonzeros  in  A  and  in  B  is  dn, 

(b)  the  expected  number  of  scalar  multiplications  in  A  -  B  is  d2n,  and 

(c)  the  expected  number  of  nonzeros  in  C  is  d2n(  1  —  o(l)). 

Proof.  Since  each  entry  of  A  and  B  is  nonzero  with  probability  d/n ,  the  expected  number  of 
nonzeros  in  each  matrix  is  n2(d/n )  =  dn.  For  each  of  the  possible  n3  scalar  multiplications 
in  A  ■  B,  the  computation  is  required  only  if  both  corresponding  entries  of  A  and  B  are 
nonzero,  which  are  independent  events.  Thus  the  probability  that  any  multiplication  is 
required  is  d2/n2,  and  the  expected  number  of  scalar  multiplications  is  d2n.  Finally,  an  entry 
of  C  =  A  ■  B  is  zero  only  if  all  n  possible  scalar  multiplications  corresponding  to  it  are  zero. 
Since  the  probability  that  a  possible  scalar  multiplication  is  zero  is  (1  —  d2/n2)  and  the  n 
possible  scalar  multiplications  corresponding  to  a  single  output  entry  are  independent,  the 
probability  that  an  entry  of  C  is  zero  is  (1  —  d2/n2)n  =  1  —  d2  jn  +  0(d4/n2).  Thus  the 
expected  number  of  nonzeros  of  C  is  n2(d2/n  —  0(d4/n 2)  =  d2n(  1  —  o(l)),  since  we  assume 
d  <  \fn.  □ 

Recall  from  Section  1.4  that  the  n3  scalar  multiplications  in  dense  matrix  multiplication 
can  be  arranged  into  a  cube  V  whose  faces  represent  the  input  matrices. 

Definition  4.3.  We  say  a  voxel  (i,j,k)  e  V  is  nonzero  if,  for  given  input  matrices  A  and 

B,  both  A(i,k)  and  B(k,j)  are  nonzero. 

Definition  4.4.  A  sparsity-independent  parallel  algorithm  for  sparse  matrix-matrix  multipli¬ 
cation  is  one  in  which  the  assignment  of  entries  of  the  input  and  output  matrices  to  processors 
and  the  assignment  of  computation  voxels  to  processors  is  independent  of  the  sparsity  pattern 
of  the  input  (or  output)  matrices.  If  an  assigned  matrix  entry  is  zero,  the  processor  need 
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not  store  it;  if  an  assigned  voxel  is  zero,  the  processor  need  not  perform  the  computation 
corresponding  to  that  voxel. 

Our  lower  bound  argument  in  Section  4.2  will  apply  to  all  sparsity-independent  algorithms. 
However,  we  will  analyze  a  more  restricted  class  of  algorithms  in  Section  4.3,  those  that 
assign  contiguous  brick-shaped  sets  of  voxels  to  each  processor. 


4.1.1  All-to-all  Communication 

Several  of  the  algorithms  we  discuss  make  use  of  all-to-all  communication.  If  each  processor 
needs  to  send  b  different  words  to  each  other  processor  (so  each  processor  needs  to  send  a 
total  of  b(P  —  1)  words),  the  bandwidth  lower  bound  is  W  —  Ll(bP)  and  the  latency  lower 
bound  is  S  —  14  (log  P).  Each  of  these  bounds  is  attainable,  but  it  has  been  shown  that  they 
are  not  simultaneously  attainable  (see  Theorem  2.9  of  [15]).  Depending  on  the  relative  costs 
of  bandwidth  and  latency,  one  may  wish  to  use  the  point-to-point  algorithm  (each  processor 
sends  data  directly  to  each  other  processor)  which  incurs  costs  of  W  =  0{bP ),  S  =  0(P )  or 
the  bit-fixing  algorithm  (each  message  of  b  words  is  sent  by  the  bit-fixing  routing  algorithm) 
which  incurs  costs  of  W  =  0(bP  log  P),  S  =  O(logP).  Both  of  these  are  optimal,  in  the 
sense  that  neither  the  bandwidth  cost  nor  the  latency  cost  can  be  asymptotically  improved 
without  asymptotically  increasing  the  other  one. 

4.2  Communication  Lower  Bounds 

The  general  lower  bounds  for  direct  linear  algebra  [10]  apply  to  our  case  and  give 

w  =  n{p7^)'  (4-1) 

This  bound  is  highest  when  M  takes  its  minimum  value  d2n/P,  in  which  case  it  becomes 
W  =  Ll(^/d2n/ P).  In  this  section  we  improve  (increase)  these  lower  bounds  by  a  factor  of 
\fn  ■  min{l,  d/\[P}.  For  larger  values  of  M,  the  lower  bound  in  Equation  4.1  becomes  weaker, 
whereas  our  new  bound  does  not,  and  the  improvement  factor  increases  to  \/M -min{l,  y/P/d}. 
The  previous  memory-independent  lower  bound  [5]  reduces  to  the  trivial  bound  W  =  14(0). 
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Theorem  4.5.  Any  sparsity-independent  sparse  matrix  multiplication  algorithm  with  load- 
balanced  input  and  output  has  an  expected  communication  cost  lower  bound  of 


W  =  Ul  \  min 


dri  d2n  1 


when  applied  to  ER(d)  input  matrices  on  P  processors. 
Note  that  this  bound  can  also  be  written  as 


w  =  n 


dn 


nun 


d 


'VP 


and  so  which  bound  applies  depends  on  the  ratio  d/\fP. 

Proof.  Consider  the  n3  voxels  that  correspond  to  potential  scalar  multiplications  A(i,  k )  • 
B(k,j).  A  sparsity-independent  algorithm  gives  a  partitioning  of  these  multiplications  among 

3 

the  P  processors.  Let  V  be  the  largest  set  of  voxels  assigned  to  a  processor,  so  \V\  >  For 
each  i,j,  let  1^.  be  the  number  of  values  of  k  such  that  (■ i,j ,  k )  e  V,  see  Figure  4.2.  We  count 
how  many  of  the  voxels  in  V  correspond  to  Cf.  <  |  and  divide  into  two  cases. 


Figure  4.2:  Graphical  representation  of  V  and 


Case  1:  At  least  ^  voxels  of  V  correspond  to  ^  Let  V'  be  these  voxels,  so  \  V'\  > 

We  will  analyze  the  communication  cost  corresponding  to  the  computation  of  V'  and  get  a 

bound  on  the  number  of  products  computed  by  this  processor  that  must  be  sent  to  other 

processors.  Since  the  output  is  load  balanced  and  the  algorithm  is  sparsity-independent,  the 

2 

processor  that  computes  V  is  allowed  to  store  only  a  particular  set  of  A;  entries  of  C  in 
the  output  data  layout.  Since  every  voxel  in  V'  corresponds  to  an  ifj  <  j,  the  ^  output 
elements  stored  by  the  processor  correspond  to  at  most  fp  voxels  in  V',  which  is  at  most 
half  of  \V'\.  All  of  the  nonzero  voxels  in  the  remainder  of  V'  contribute  to  entries  of  C  that 
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must  be  sent  to  another  processor.  In  expectation,  this  is  at  least  ^  nonzero  voxels,  since 
each  voxel  is  nonzero  with  probability  Moreover,  from  Lemma  4.2,  only  a  small  number 
of  the  nonzero  entries  of  C  have  contributions  from  more  than  one  voxel,  so  very  few  of  the 
values  can  be  summed  before  being  communicated.  The  expected  bandwidth  cost  is  then 
bounded  by  W  =  Q(d2n/P). 

Case  2:  Fewer  than  voxels  of  V  correspond  to  £/  <  j.  This  means  that  at  least  jp 
voxels  of  V  correspond  to  £fj  >  Let  V"  be  these  voxels,  so  \V"\  >  g.  We  will  analyze 
the  communication  cost  corresponding  to  the  computation  of  V"  and  get  a  lower  bound  on 
the  amount  of  input  data  needed  by  this  processor.  For  each  i,  k,  let  £fk  be  the  number  of 
values  of  j  such  that  (i,  j,  k)  G  V" .  Similarly,  for  each  j,  k ,  let  lp‘k  be  the  number  of  values  of 
%  such  that  (i,  j,  k)  €  V" .  Partition  V"  into  three  sets:  Vo  is  the  set  of  voxels  that  correspond 
to  >  y  and  tPk  >  Va  is  the  set  of  voxels  that  correspond  to  £fk  <  and  Vb  is  the  set 
of  voxels  that  correspond  to  £pk  <  ^  and  £\ |  >  ^.  At  least  one  of  these  sets  has  at  least  ^ 
voxels,  and  we  divide  into  three  subcases. 

Case  2a:  |Vo|  >  fp-  Let.  pa,  Pb,  and  pc  be  the  sizes  of  the  projections  of  Vo  onto  A,  B , 
and  C,  respectively.  Lemma  1.1  implies  that  PaPbPc  >  | Vo | 2  =  ^pp2-  The  assumptions  of 
Case  2  implies  pc  <  j^j.  Thus  PaPb  >  or  ma x{pa,Pb}  >  Jppp-  Since  the  situation 
is  symmetric  with  respect  to  A  and  i?,  assume  without  loss  of  generality  that  A  has  the 
larger  projection,  so  pa  >  -/=.  Since  the  density  of  A  is  this  means  that  the  expected 
number  of  nonzeros  in  the  projection  of  Vo  onto  A  is  at  least  -J= =.  Since  each  of  these 
nonzeros  in  A  corresponds  to  a  &fk  >  T  it  is  needed  to  compute  Vo  with  probability  at  least 
1  —  (1  —  d/n)n^d  >  1  —  1/e.  Thus  in  expectation  a  constant  fraction  of  the  nonzeros  of  A  in 
the  projection  of  Vo  are  needed.  The  number  of  nonzeros  the  processor  holds  in  the  initial 
data  layout  is  ^  in  expectation,  which  is  asymptotically  less  than  the  number  needed  for  the 
computation.  Thus  we  get  a  bandwidth  lower  bound  of  W  =  f l{dn/\fP). 

Case  2b:  \Va\  >  fp-  Each  voxel  in  Va  corresponds  to  £fk  <  In  this  case  we  are  able 
to  bound  the  re-use  of  entries  of  rriA  to  get  a  lower  bound.  Count  how  many  entries  of  A 
correspond  to  each  possible  value  of  lfk,  1  <  r  <  ?,  and  call  this  number  Nr.  Note  that 

Ylr=ir  '  AV  =  \VA\.  Suppose  a  given  entry  A(i,  k )  corresponds  to  £fk  =  r.  We  can  bound  the 
probability  that  A(i,k )  is  needed  by  the  processor  to  compute  Vi  as  a  function  f(r).  The 
probability  that  A(i,  k )  is  needed  is  the  probability  that  both  A(i,  k )  is  nonzero  and  one  of 
the  r  voxels  corresponding  to  A{i,  k)  in  Vi  is  nonzero,  so 


f(r)  = 


d 

n 


> 


rd2 

2^’ 


since  r  <  jj.  Thus  the  expected  number  of  nonzeros  of  A  that  are  needed  by  the  processor  is 


n/d 

r=  1 


d2 
2  n2 


n/d 

r  ■  Nr> 

r— 1 


d2n 

12 P' 
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This  is  asymptotically  larger  than  the  number  of  nonzeros  the  processor  holds  at  the  beginning 
of  the  computation,  so  we  get  a  bandwidth  lower  bound  of  W  =  fl(d2n/P). 

Case  2c:  \Vb\  >  fp.  The  analysis  is  identical  to  the  previous  case,  except  we  look  at  the 
number  of  nonzeros  of  B  that  are  required. 

Since  an  algorithm  may  be  in  any  of  these  cases,  the  overall  lower  bound  is  the  minimum: 


W 


n  ( 


nun 


V 


dn  d2n  1  \ 


□ 


4.3  Algorithms 

In  this  section  we  consider  algorithms  which  assign  contiguous  bricks  of  voxels  to  processors. 
We  categorize  these  algorithms  into  ID,  2D,  and  3D  algorithms,  as  shown  in  Figure  4.1.  If 
we  consider  the  dimensions  of  the  brick  of  voxels  assigned  to  each  processor,  ID  algorithms 
correspond  to  bricks  with  two  dimensions  of  length  n  (and  1  shorter),  2D  algorithms  correspond 
to  bricks  with  one  dimension  of  length  n  (and  2  shorter),  and  3D  algorithms  correspond 
to  bricks  with  all  3  dimensions  shorter  than  n.  Table  4.1  provides  a  summary  of  the 
communication  costs  of  the  sparsity-independent  algorithms  we  consider. 


Algorithm 

Bandwidth  cost 

Latency  cost 

Previous  Lower  Bound  [10] 

d2n 

Ps/M 

0 

Lower  Bound  [4] 

1 

Naive  Block  Row  [18] 

dn 

P 

ID 

Improved  Block  Row*  [22] 

d2n 

P 

min  { log  P,  f  } 

Outer  Product*  [46] 

d2n 

P 

logP 

2D 

SpSUMMA  [18] 

dn 

Vp 

VP 

3D 

Recursive  [4] 

min  {  A?  flog  A]  } 

log  P 

Table  4.1:  Asymptotic  expected  communication  costs  of  sparsity-independent  algorithms  and 
lower  bounds.  Algorithms  marked  with  an  asterisk  make  use  of  all-to-all  communication. 
Depending  on  the  algorithm  used  for  the  all-to-all,  either  the  bandwidth  or  latency  cost  listed 
is  attainable,  but  not  both;  see  Section  4.1.1. 
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4.3.1  ID  Algorithms 

Naive  Block  Row  Algorithm 

The  naive  block  row  algorithm  [18]  distributes  the  input  and  output  matrices  to  processors 
in  a  block  row  fashion.  Then  in  order  for  processor  i  to  compute  the  ith  block  row,  it  needs 
access  to  the  ith  block  row  of  A  (which  it  already  owns),  and  potentially  all  of  B.  Thus,  we 
can  allow  each  processor  to  compute  its  block  row  of  C  by  leaving  A  and  C  stationary  and 
cyclically  shifting  block  rows  of  B  around  a  ring  of  the  processors.  This  algorithm  requires  P 
stages,  with  each  processor  communicating  with  its  two  neighbors  in  the  ring.  The  size  of 
each  message  is  the  number  of  nonzeros  in  a  block  row  of  R,  which  is  expected  to  be  dn/P 
words.  Thus,  the  bandwidth  cost  of  the  block  row  algorithm  is  dn  and  the  latency  cost  is  P. 
An  analogous  block  column  algorithm  works  by  cyclically  shifting  block  columns  of  A  with 
identical  communication  costs. 

Improved  Block  Row  Algorithm 

The  communication  costs  of  the  block  row  algorithm  can  be  reduced  without  changing  the 
assignment  of  matrix  entries  or  voxels  to  processors  [22],  The  key  idea  is  for  each  processor 
to  determine  exactly  which  rows  of  B  it  needs  to  access  in  order  to  perform  its  computations. 
For  example,  if  processor  i  owns  the  ith  block  row  of  A,  Aj,  and  the  jth  subcolumn  of  A* 
contains  no  nonzeros,  then  processor  i  doesn’t  need  to  access  the  j th  row  of  B.  Further,  since 
the  height  of  a  subcolumn  is  n/ P,  the  probability  that  the  subcolumn  is  completely  empty  is 

Pr[nnz(Ai(:,j))  =  0]  =  ^1  -  ~ 

assuming  d  <  P.  In  this  case,  the  expected  number  of  subcolumns  of  Aj  which  have  at  least 
one  nonzero  is  dn/P.  Since  processor  i  needs  to  access  only  those  rows  of  B  which  correspond 
to  nonzero  subcolumns  of  Aj,  and  because  the  expected  number  of  nonzeros  in  each  row  of  B 
is  d,  the  expected  number  of  nonzeros  of  B  that  processor  i  needs  to  access  is  d2n/P. 

Note  that  the  local  memory  of  each  processor  must  be  of  size  12  ( d2n/P )  in  order  to  store 
the  output  matrix  C.  Thus,  it  is  possible  for  each  processor  to  gather  all  of  their  required  rows 
of  B  at  once.  The  improved  algorithm  consists  of  each  processor  determining  which  rows  it 
needs,  requesting  those  rows  from  the  appropriate  processors,  and  then  sending  and  receiving 
approximately  d  rows.  While  this  can  be  implemented  in  various  ways,  the  bandwidth  cost  of 
the  algorithm  is  at  least  12  ( d2n/ P )  and  if  point-to-point  communication  is  used,  the  latency 
cost  is  at  least  12 (min {P,  dn/P}).  The  block  column  algorithm  can  be  improved  in  the  same 
manner. 

Outer  Product  Algorithm 

Another  possible  ID  algorithm  is  to  partition  A  in  block  columns,  and  B  in  block  rows  [46]. 
Without  communication,  each  processor  locally  generates  an  n  x  n  sparse  matrix  of  rank  n/P, 
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and  processors  combine  their  results  to  produce  the  output  C.  Because  each  column  of  A  and 
row  of  B  have  about  d  nonzeros,  the  expected  number  of  nonzeros  in  the  locally  computed 
output  is  d2n/P.  By  deciding  the  distribution  of  C  to  processors  up  front,  each  processor  can 
determine  where  to  send  each  of  its  computed  nonzeros.  The  final  communication  pattern 
is  realized  with  an  all-to-all  collective  in  which  each  processer  sends  and  receives  0(d2n/P) 
words.  Note  that  assuming  A  and  B  are  initially  distributed  to  processors  in  different  ways 
may  be  unrealistic;  however,  no  matter  how  they  are  initially  distributed,  A  and  B  can  be 
transformed  to  block  column  and  row  layouts  with  all-to-all  collectives  for  a  communication 
cost  which  is  dominated  by  the  final  communication  phase. 

To  avoid  the  all-to-all,  it  is  possible  to  compute  the  expected  number  of  blocks  of  the 
output  which  actually  contain  nonzeros;  the  best  distribution  of  C  is  2D,  in  which  case  the 
expected  number  of  blocks  of  C  you  need  to  communicate  is  min{P,  dn/y/P}.  Thus  for 
P  >  ( dn )2/3,  the  outer  product  algorithm  can  have  W  =  0(d2n/P )  and  S  =  0(dn/ \/P). 

4.3.2  2D  Algorithms 

Sparse  SUMMA 

In  the  Sparse  SUMMA  algorithm  [18],  the  brick  of  voxels  assigned  to  a  processor  has  its 
longest  dimension  (of  length  n)  in  the  k  dimension.  For  each  output  entry  of  C  to  which  it 
is  assigned,  the  processor  computes  all  the  nonzero  voxels  which  contribute  to  that  output 
entry.  The  algorithm  has  a  bandwidth  cost  of  0(dn/ \/P)  and  a  latency  cost  of  0(\fP)  [18]. 

Improved  Sparse  SUMMA 

In  order  to  reduce  the  latency  cost  of  Sparse  SUMMA,  each  processor  can  gather  all  the 
necessary  input  data  up  front.  That  is,  each  processor  is  computing  a  product  of  a  block  row 
of  A  with  a  block  column  of  B ,  so  if  it  gathers  all  the  nonzeros  in  those  regions  of  the  input 
matrices,  it  can  compute  its  block  of  C  with  no  more  communication.  Since  every  row  of 
A  and  column  of  B  contain  about  d  nonzeros,  and  the  number  of  rows  of  A  and  columns 
of  B  in  a  block  is  n/y/P,  the  number  of  nonzeros  a  processor  must  gather  is  0(dn/ y/P).  If 
d  >  a/P,  then  the  memory  requirements  for  this  gather  operation  do  not  exceed  the  memory 
requirements  for  storing  the  block  of  the  output  matrix  C,  which  is  0(d2n/ P). 

The  global  communication  pattern  for  each  processor  to  gather  its  necessary  data  consists 
of  allgather  collectives  along  processor  columns  and  along  processor  grids.  The  bandwidth 
cost  of  these  collectives  is  0(dn/\[P ),  which  is  the  same  as  the  standard  algorithm,  and  the 
latency  cost  is  reduced  to  O(logP).  To  our  knowledge,  this  improvement  has  not  appeared 
in  the  literature  before. 

We  might  also  consider  applying  the  optimization  that  improved  the  ID  block  row 
(or  column)  algorithm.  Processor  (i,j)  would  need  to  gather  the  indices  of  the  nonzero 
subcolumns  of  and  the  nonzero  subrows  of  Bj.  This  requires  receiving  Ll(dn/\/P )  words, 
and  so  it  cannot  reduce  the  communication  cost  of  Sparse  SUMMA. 
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As  in  the  dense  case,  there  are  variants  on  the  sparse  SUMMA  algorithm  that  leave  one  of 
the  input  matrices  stationary,  rather  than  leaving  the  output  matrix  C  stationary  [40] .  When 
multiplying  ER(d)  matrices,  stationary  input  matrix  algorithms  require  more  communication 
that  the  standard  approach  because  the  global  data  involved  in  communicating  C  is  about 
d2n,  while  the  global  data  involved  in  communicating  A  and  B  is  only  dn. 

4.3.3  3D  Recursive  Algorithm 

Next  we  adapt  the  BFS/DFS  approach  to  the  sparse  case.  Although  we  have  assumed 
that  the  input  matrices  are  square,  the  recursive  algorithm  will  use  rectangular  matrices  for 
subproblems.  Assume  that  P  processors  are  solving  a  subproblem  of  size  m  x  k  x  m,  that  is 
dismx  k,  and  B  is  k  x  m,  and  C  is  m  x  m.  We  will  split  into  four  subproblems,  and  then 
solve  each  subproblem  independently  on  a  quarter  of  the  processor.  There  are  two  natural 
ways  to  split  the  problem  into  four  equal  subproblems  that  respect  the  density  similarity 
between  A  and  B ,  see  Figure  4.3. 

1.  Split  m  in  half,  creating  four  subproblems  of  shape  (m/2)  x  k  x  (m/2).  In  this  case 
each  of  the  four  subproblems  needs  access  to  a  different  part  of  C ,  so  no  communication 
of  C  is  needed.  However  one  half  of  A  and  B  is  needed  for  each  subproblem,  and  since 
each  quarter  of  the  processors  holds  only  one  quarter  of  each  matrix,  it  will  be  necessary 
to  replicate  A  and  B.  This  can  be  done  via  allgather  collectives  among  disjoint  pairs  of 
processors  at  the  cost  of  O  ( dmk/(nP ))  words  and  0(1)  messages. 

2.  Split  k  in  quarters,  creating  four  subproblems  of  shape  m  x  [kj 4)  x  m.  In  this  case 
each  of  the  four  subproblems  needs  access  to  a  different  part  of  A  and  B,  so  with  the 
right  data  layout,  no  communication  of  A  or  B  is  needed.  However  each  subproblem 
will  compute  nonzeros  across  all  of  C,  so  those  entries  need  to  be  redistributed  and 
combined  if  necessary.  This  can  be  done  via  all-to-all  collective  among  disjoint  sets  of  4 
processors  at  a  cost  of  0(d2m2 / (nP))  words  and  0(1)  messages. 


m 


A  B  C 


m 


k 

1 

m 

2 

3 

4 

k 

2 

m 

1,2, 3, 4 

3 

4 

A  B  C 


Figure  4.3:  Two  ways  to  split  the  matrix  multiplication  into  four  subproblems,  with  the  parts 
of  each  matrix  required  by  each  subproblem  labelled.  On  the  left  is  split  1  and  on  the  right 
is  split  2. 


At  each  recursive  step,  the  algorithm  chooses  whichever  split  is  cheapest  in  terms  of 
communication  cost.  Initially,  m  =  k  =  n  so  split  1  costs  0(dn/P)  words  and  is  cheaper 
than  split  2,  which  costs  0(d2n/P)  words.  There  are  two  cases  to  consider. 
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Case  1:  If  P  <  d2,  the  algorithm  reaches  a  single  processor  before  split  1  becomes  more 
expensive  than  split  2,  so  only  split  1  is  used.  This  case  corresponds  to  a  2D  algorithm,  and 
the  communication  costs  are 


W 


log4  P-1 

£  o 


i= 0 


/  d(n/2l)n\ 

v  m  ) 


S  =  0(logP). 


Case  2:  If  P  >  d2,  split  1  becomes  more  expensive  than  split  2  after  log 2d  steps.  After 
log2  d  steps,  the  subproblems  have  dimensions  (n/d)  x  n  x  (n/d)  and  there  are  P/d 2  processors 
working  on  each  subproblem.  The  first  log2  d  steps  are  split  1,  and  the  rest  are  split  2,  giving 
communication  costs  of 


W 


log2  d—1 

£  ° 


i=0 


/  d{n/2l)n\ 

V  m  ) 


log  4P 

+  £  o 

i=log2  d 


S  =  O(logP). 


This  case  corresponds  to  a  3D  algorithm. 

In  both  cases,  the  communication  costs  match  the  lower  bound  from  Section  4.2  up  to 
factors  of  at  most  logP.  Only  layouts  that  are  compatible  with  the  recursive  structure 
of  the  algorithm  will  allow  these  communication  costs.  One  simple  layout  is  to  have  A  is 
block-column  layout,  B  in  block-row  layout.  Then  C  should  have  blocks  of  size  n/d  x  n/d , 
each  distributed  on  a  different  \P/d?~\  of  the  processors. 

Note  that  since  BFS  only  algorithm  does  not  use  more  than  a  constant  factor  extra 
memory,  there  is  no  need  for  DFS  steps. 


4.4  Performance  Results 

We  have  implemented  the  block  row  algorithm,  the  improved  block  row  algorithm,  the  outer 
product  algorithm,  sparse  SUMMA,  and  the  recursive  algorithm  in  MPI  and  C++.  The  five 
implementations  share  the  same  local  sparse  matrix  multiplication  code.  All  our  experiments 
are  of  Erdos- Renyi  random  matrices  whose  entries  are  random  double  precision  values.  They 
are  stored  in  coordinate  layout,  and  use  either  32-bit.  or  64-bit.  integers  for  indexing,  as 
appropriate  for  the  matrix  dimension. 

We  benchmarked  on  Titan1,  a  Cray  XK7  at  Oak  Ridge  National  Laboratory.  It  consists 
of  18,688  nodes  connected  by  a  Gemini  interconnect.  Each  node  has  32GB  of  RAM,  a  16  core 
AMD  Opt.eron  6274  processor,  and  an  Nvidia  K20  GPU.  As  of  November  2012,  it  ranked 
first  on  the  TOP500  list  [53],  with  a  LINPACK  score  of  17.59  Tflop/s.  In  our  experiments, 
we  only  make  use  of  the  CPUs  and  not.  the  GPUs.  We  use  one  core  per  process  (16  MPI 
processes  per  node). 

In  general,  the  recursive  algorithm  outperforms  all  the  previous  algorithms,  especially 
when  run  at  large  scale.  Among  the  previous  algorithms,  when  d  is  small  relative  to  \[P,  the 


1For  machine  details,  see  http://www.olcf.ornl.gov/titan/ 
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(a)  n  =  226,  d  =  2° 


(b)  n  =  218,  d  =  24 


(c)  n  =  226,  d  =  22  (d)  n  =  218,  d  =  26 


(e)  n  =  226,  d  =  24 

Improved  Row  — • —  Block  Row  — & 


(f)  n  =  218,  d  =  28 

Outer  *  SpSUMMA  — ■ —  Recursive  — * — 


Figure  4.4:  Strong  scaling  of  sparse  matrix  multiplication. 
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outer  product  algorithm  and  the  improved  block  row  algorithm  perform  best;  whereas  when 
d  is  large  relative  to  y/P,  sparse  SUMMA  performs  best.  This  is  as  one  would  expect  from 
the  communication  costs  of  each  algorithm  (see  Table  4.1). 

Figure  4.4  shows  strong  scaling  of  all  five  algorithms  on  a  variety  of  problem  sizes.  In 
the  top  row,  the  output  has  226  expected  nonzeros  (the  actual  number  of  nonzeros  in  the 
experiments  is  within  0.1%  of  the  expected  number),  so  the  input  and  output  can  be  stored  on 
one  node.  When  d  —  1,  on  up  to  256  cores,  the  best  performing  algorithms  are  the  recursive 
algorithm,  the  outer  product  algorithm,  and  the  improved  block  row  algorithm.  Beyond  that 
point,  the  recursive  algorithm  substantially  outperforms  the  other  two.  For  d  —  1,  every  step 
of  the  recursive  algorithm  is  of  type  2,  and  it  is  essentially  the  same  algorithm  as  the  outer 
product  algorithm.  The  difference  in  performance  is  because,  when  using  many  cores,  the 
all-to-all  on  entries  of  C  that  is  performed  by  our  recursive  code  substantially  outperforms 
the  Alltoallv  function  of  MPI.  In  this  case,  the  recursive  algorithm  has  a  best  time  to  solution 
of  0.0303  seconds,  which  makes  it  11  x  faster  than  the  next  best  algorithm:  improved  block 
row  algorithm.  When  d  =  16,  sparse  SUMMA  is  the  closest  competitor,  and  the  recursive 
algorithm  is  about  4x  faster. 

In  the  second  row  of  Figure  4.4,  the  output  has  230  expected  nonzeros.  When  d  =  4,  the 
recursive  algorithm’s  minimum  time  to  solution  is  about  8. lx  better  than  the  improved  block 
row  algorithm,  which  is  the  closest  competitor.  When  d  =  16,  sparse  SUMMA  is  the  closest 
competitor,  and  the  recursive  algorithm  is  3.9  x  faster. 

In  the  third  row  of  Figure  4.4,  the  output  has  234  expected  nonzeros.  When  d  =  16,  the 
recursive  algorithm  beats  all  the  other  algorithms  by  at  least  6x.  Note  that  among  the 
other  algorithms,  the  best  performance  is  by  the  improved  block  row  algorithm  using  4096 
cores,  which  is  near  the  minimum  to  hold  the  input  and  output;  only  the  recursive  algorithm 
was  able  to  use  more  parallelism  to  run  faster  than  this.  When  d  =  256,  sparse  SUMMA  is 
again  the  closest  competitor,  and  the  recursive  algorithm  is  4.7x  faster.  Only  the  recursive 
algorithm  shows  significant  performance  gains  when  going  from  16k  cores  to  64k  cores. 

Figure  4.5  shows  a  breakdown  of  the  time  between  interprocessor  communication  (MPI 
calls),  local  computation  (everything  else),  and  time  imbalance  between  the  processors  using 
4096  cores  with  230  nonzeros  in  the  output.  All  of  the  previous  algorithms  are  communication 
bound,  whereas  the  recursive  algorithm  is  able  to  reduce  the  communication  to  be  faster  than 
the  local  computation  (which  is  expected  to  itself  be  communication  bound  in  the  memory 
hierarchy).  Load  imbalance  is  not  a  significant  problem  for  any  of  the  algorithms. 

In  Figure  4.6,  we  explore  the  possibilities  for  interleaving  type  1  and  type  2  recursive 
steps  in  the  recursive  algorithm  on  P  =  1024  cores.  Our  implementation  allows  for  arbitrary 
interleavings,  which  makes  for  2log4P  =  yfP  possible  executions.  For  large  values  of  P,  this 
is  too  many  to  realistically  explore,  so  in  Figures  4.4  and  4.5  we  only  explored  “simple” 
interleavings:  those  for  which  all  of  the  type  1  steps  are  taken  before  any  of  the  type  2  steps. 
These  are  shown  in  green  in  Figure  4.6.  In  red  in  Figure  4.6  is  the  case  of  log2  d  —  1  type 
1  steps  followed  by  log4  P  —  log2  d  +  1  type  2  steps.  Taking  into  account  the  fact  that  two 
matrices  must  be  communicated  at  a  type  1  step  ( A  and  B),  whereas  only  one  must  be 
communicated  at  a  type  2  step  (C),  this  is  what  the  analysis  of  Section  4.3.3  suggests  will 
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Figure  4.5:  Time  breakdown  on  4096  cores.  In  each  case,  the  expected  number  of  nonzeros 
in  the  output  is  d2n  =  230.  For  n  =  230,  d  =  1,  the  block  row  algorithm  spent  39.3  seconds 
on  communication,  11.9  seconds  on  local  computation,  and  0.985  seconds  were  due  to  load 
imbalance,  but  the  plot  is  cut  off  at  12  seconds. 


perform  best.  In  three  of  the  four  cases  this  is  the  fastest  option,  and  in  the  fourth  it  is  only 
8.3%  slower  than  the  best  interleaving.  This  suggests  that  following  that  rule  will  give  very 
close  to  optimal  performance,  and  tuning  is  probably  not  necessary. 
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(a)  n  =  226,  d  =  2° 


(b)  n  =  222,  d  =  22 


(c)  n  =  218,  d  =  24 


(d)  n  =  214,  d  =  26 


Figure  4.6:  Possible  interleavings  of  the  recursive  algorithm  on  1024  processors.  The  red 
bar  corresponds  to  taking  log2  d  —  1  steps  of  type  1  followed  by  the  remaining  steps  of  type 
2.  Green  bars  correspond  to  taking  all  type  1  steps  before  the  first  type  2  step.  Blue  bars 
correspond  to  other  interleavings. 
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Chapter  5 

Beyond  Matrix  Multiplication 


In  this  chapter  we  show  how  to  use  the  BFS/DFS  approach  to  parallelize  various  recursive 
algorithms  beyond  matrix  multiplication.  We  start  with  naive  n-body  interaction,  for  which 
the  recursive  calls  are  independent.  We  then  examine  several  algorithms  with  dependencies 
between  the  recursive  calls. 


5.1  Naive  n-body  Interaction 

For  an  n-body  simulation,  at  each  timestep  we  want  to  compute  the  force  on  each  particle 
due  to  each  other  particle.  This  can  be  computed  using  the  naive  iterative  algorithm  as: 
for  i  in  1  :  n  do 
for  j  in  1  :  n  do 
Fi  =  force  (i,j) 

where  force(i,  j)  computes  the  force  on  particle  i  due  to  particle  j. 

To  recast  this  as  a  recursive  algorithm,  generalize  it  slightly  to  computing  the  forces  on 
n  particle  in  a  list  A  due  to  n  particles  in  a  (possibly  different)  list  B.  Then  the  recursive 
algorithm  is: 

function  RECNBODY(z4,R,n) 
if  n  =  1  then 

force(R(l),  B(  1)) 
else 

RecNbody(R(l  :  n/2),R(l  :  n/2)) 

RecNbody(R(l  :  n/2),  B(n/2  +  1  :  n)) 

RecNbody(R(n/2  +  1  :  n),B(  1  :  n/2)) 

RecNbody(R(n/2  +  1  :  n),  5  (n/2  +  1  :  n)) 

The  function  makes  four  recursive  calls  to  itself.  The  four  calls  are  not  quite  independent: 
two  of  them  update  each  particle  in  A.  If  we  instead  make  two  copies  A  for  the  two  different 
calls,  and  then  combine  the  contributions  at  the  end,  they  become  independent  at  the  cost 
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of  0(n)  work  after  the  recursive  calls.  The  parallelization  is  now  very  similar  to  the  case  of 
square  matrix  multiplication. 


5.1.1  Unlimited  Memory 

Assume,  for  now,  that  memory  is  not  a  limiting  factor.  Then  we  can  perform  the  calculation 
by  doing  k  BFS  steps,  followed  by  local  computation.  For  each  BFS  step,  the  communication 
consists  of  3  all-to-all’s  between  disjoint  sets  of  4  processors:  one  to  redistribute  A,  one  to 
redistribute  B,  and  one  to  collect  the  forces.  Assume  that  a  single  particle  or  force  can  be 
represented  by  0(1)  words.  The  number  of  words  in  each  message  is  all  the  data  a  processor 
has  about  half  of  the  particles  (or  forces).  Thus  at  each  step,  each  processor  sends  and 
receives  0(1)  messages  of  length  0{n/P).  The  recurrences  for  bandwidth  and  latency  are 
thus 

W(n,P)  =  W (n/2,  P/4)  +  0(n/P)  =  0(n/VP), 

S(n,P )  =  S(n/2,  P/4)  +  0(1)  =  O(logP). 

The  memory  usage  is  a  geometric  sum,  since  each  BFS  step  needs  twice  as  much  memory  as 
the  previous,  and  is  dominated  by  the  last  term: 

M(n,P)  =  0(2kn/P)  =  0{n/\fP). 


5.1.2  Limited  Memory 

If  there  is  not  enough  memory  to  perform  the  unlimited  memory  scheme  above,  it  is  possible 
to  perform  DFS  steps,  each  of  which  reduces  the  future  memory  requirements  by  a  factor  of 
2.  If  there  is  M  local  memory  available,  take 


£  =  log2 


n 

M  \fP 


+  c, 


for  some  constant  c,  DFS  steps  followed  by  k  BFS  steps.  This  fits  inside  the  available  local 
memory  M,  and  consists  of  Y  calls  to  the  unlimited  memory  case  with  size  n/2  .  Thus  its 
costs  are 

Tp)  =0(£)' 

S(n,  P )  =  4'0(log  P)  =  O  (XL  logpj  . 

These  asymptotically  match  the  communication  costs  of  the  1.5D  iterative  algorithm  and 
the  lower  bounds  presented  in  [33]. 
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5.2  Dealing  with  Dependencies 

So  far,  all  of  the  algorithm  we  have  considered  have  had  independent  recursive  calls,  possibly 
with  some  work  at  the  beginning  and  end.  For  many  recursive  algorithms,  however,  this  is 
not  the  case.  To  obtain  communication-efficient  parallelizations  in  such  cases,  we  modify 
BFS/DFS  approach  by  allowing  the  algorithm  to  discard  some  fraction  of  the  processors  at 
each  recursive  step.  The  reason  is  to  avoid  having  to  solve  many  small  base  cases  on  all  P 
processors,  which  would  incur  a  high  latency  cost.  Discarding  processors  will  always  increase 
the  arithmetic  and  bandwidth  cost,  but  if  we  do  not  discard  too  many  at  each  recursive  step, 
we  can  keep  these  from  increasing  by  more  than  constant  factors. 


5.3  All-Pairs  Shortest  Paths 


Consider  the  Floyd- Warshall  algorithm  for  computing  all-pairs  shortest  paths  on  a  graph 
(the  minimum  distance  between  every  pair  of  vertices  in  a  graph).  We  can  write  the  problem 
using  three  nested  loops  so  that  it  looks  like  matrix  multiplication,  except  with  addition  and 
multiplication  replaced  by  taking  minima  and  addition,  respectively.  Additionally,  unlike 
matrix  multiplication,  there  are  dependencies  between  the  iterations  in  the  inner  loop,  so 
they  cannot  be  arbitrarily  re-ordered  and  parallelized. 

Like  matrix  multiplication,  the  Floyd- Warshall  algorithm  can  be  recast  as  a  recursive 
algorithm,  where  each  index  is  split  in  half  and  there  are  8  subproblems  to  solve  [55].  Because 
of  the  dependencies,  all  8  subproblems  must  be  solved  in  order.  Fortunately,  however,  6  of 
the  8  subproblems  have  no  internal  dependencies,  and  thus  have  exactly  the  same  asymptotic 
costs  as  matrix  multiplication,  while  the  remaining  2  have  the  same  dependencies  as  the 
original  problem. 

Consider  starting  with  the  problem  of  size  n  on  P  processors,  using  only  P/a  of 
those  processors  to  solve  the  two  subproblems.  The  cost  recurrence  is  then  FW(n,  P )  = 
2FW(n/2,  P/a )  +  6GEMM(n/2,  P )  +  fin2 / P  +  a.  Here  FW(n,  P )  is  the  cost  (arithmetic,  bandi- 
wdith,  or  latency)  of  the  recursive  Floyd- Warshall  algorithm,  GEMM(n,  P)  is  the  cost  of  square 
n  x  n  matrix  multiplication  on  P  processors,  and  fin2 / P  +  a  is  the  communication  cost  of 
redistributing  the  input  and  output  at  each  recursive  step.  Assuming  a  >  1,  the  base  case 
computation  is  performed  on  one  processor,  without  incurring  any  communication  costs,  after 
loga  P  recursive  steps.  In  terms  of  flops,  this  gives 


F(n,  P )  =  2 F 


n  P\ 
2  ’~a) 


lQg  aP 

No 


i= 1 


O  A)  a  <4 
o{p^)  a  >4 


Thus  as  long  as  a  <  4  (at  least  1/4  of  the  processors  are  kept  at  each  recursive  step),  the 
computation  is  asymptotically  load  balanced,  although  it  will  suffer  a  constant  factor  increase 
in  the  ffop  cost.  Taking  a  >  4  turns  out  to  be  asymptotically  equivalent  to  discarding  some 
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of  the  processors  at  the  beginning  and  then  using  a  smaller  value  of  a,  so  we  do  not  consider 
that  case  further. 

In  terms  of  bandwidth  cost,  it  gives 


W(n,  P)  =  2 W 


n  P\ 
2  l~a) 


+  0 


n3  \ 

pVm) 


0 

0 


P  2/3 
,.2 


+ 


PsfM 

+ 


Plo«a  2  _r  Ps/m) 


a  <  23/2 
23/2  <  a  <  4 


The  communication  lower  bounds  of  [10,  5]  apply  to  Floyd- Warshall,  so  the  recursive  algorithm 
is  bandwidth-optimal  for  small  values  of  a.  The  value  of  a  between  a  =  23/2  and  a  =  4  at 
which  it  becomes  not  bandwidth-optimal  depends  on  how  much  memory  is  available. 

The  latency  cost  is 


S{n,  P)  =  2S 


f  n  P 
~a 


+  0 


(1  +  pSw) 
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^ pl°ga  2  _|_ 


n3  \ 
PM3/2  ) 


as  long  as  a  <  4.  Increasing  a  decreases  the  latency  cost,  because  it  avoids  having  many 
subproblems  to  solve,  each  using  many  processors.  Because  of  the  dependencies,  the  latency 
is  always  higher  than  log  P,  unlike  for  matrix  multiplication.  For  a  >  23/2  (assuming  there  is 
enough  memory),  the  latency  and  bandwidth  match  the  conjectured  tradeoff  lower  bound  of 
[59] .  The  asymptotic  costs  of  their  iterative  algorithm  are  very  similar  to  the  costs  of  the 
recursive  algorithm  shown  above:  their  case  of  c  =  1  corresponds  here  to  a  =  4,  and  their 
case  of  c  =  P1/3  corresponds  to  a  =  23/2.  A  similar  algorithm  was  also  presented  in  [63].  The 
recursive  algorithm  requires  no  more  than  a  constant  factor  extra  memory  for  any  a  <  4, 
although  if  more  is  available  the  subproblems  that  look  like  matrix  multiplication  benefit 
from  it. 


5.4  Triangular  Solve  with  Multiple  Righthand  Sides 


Next,  consider  solving  a  triangular  system  LX  =  Y,  where  L  is  an  n  X  n  lower  triangular 
matrix,  and  X  and  Y  are  n  x  m  matrices;  the  inputs  are  L  and  Y  and  the  output  is  X. 

First  consider  the  square  case  that  m  =  n.  A  recursive  algorithm  is  given  in  [8]  that 
splits  both  dimensions  of  all  three  matrices  in  half,  and  has  four  recursive  calls  to  triangular 
solves,  and  two  matrix  multiplications.  The  critical  path  consists  of  one  triangular  solve, 
then  one  matrix  multiplication,  then  another  triangular  solve;  meanwhile  the  other  half  of 
the  computation  can  be  done  in  parallel.  This  leads  to  a  cost  recurrence  of  TRSM(n,  n,  P)  = 
2TRSM(n/2,  nj 2,  Pj (2a))  +  GEMM(n/2,  P/2)  +  (3n2 / P  +  a.  The  base  case  is  local  computation 
on  one  processor  after  log2a  P  recursive  steps.  As  in  the  case  of  all  pairs  shortest  path,  we 
analyze  the  flop,  bandwidth,  and  latency  costs  as  a  function  of  a.  The  flop  cost  is 


5 
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as  long  as  a  <  2.  The  bandwidth  cost  is 


W{n,P)  =  2W{,lLyo[Nr3+JN 
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The  latency  cost  is 


n  P\  „  /  _  u" 
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as  long  as  a  <  2.  Increasing  a  decreases  latency  cost  at  the  expense  of  higher  bandwidth  and 
flop  cost.  Since  we  are  limited  to  a  <  2,  the  latency  cost  cannot  be  reduced  below  0(P 1/2). 
For  small  values  of  a,  the  bandwidth  cost  asymptotically  matches  the  lower  bounds  of  [10,  5]. 

We  can  improve  the  latency  cost,  even  for  the  square  case,  by  allowing  the  two  dimensions 
to  be  split  separately.  In  [11],  it  is  shown  how  to  break  the  recursive  algorithm  of  [8]  into  two 
recursive  steps.  One  step  splits  m,  creating  two  triangular  solves  on  matrices  of  half  as  many 
columns  that  can  be  performed  in  parallel.  The  other  step  splits  n,  creating  two  triangular 
solves  with  half  as  many  rows,  and  a  matrix  multiplication,  all  three  of  which  must,  be 
performed  in  order.  Consider  an  algorithm  that  combines  one  step  of  splitting  n  with  k  steps 
of  splitting  m  (so  the  algorithm  considered  above  is  the  case  k  =  1).  This  leads  to  a  recurrence 
of  TRSM(n,  m,  P)  =  2TRSM(n/2,  m/2k,  Pj (2fca))  +  0(GEMM(n,  n,  m,  P))  +  (run  +  n2) / P f3 ,  where 
the  last  term  is  the  bandwidth  cost  of  redistribution  at  each  step.  The  flop  cost  is 
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as  long  as  a  <  2.  The  bandwidth  cost  due  to  the  matrix  multiplications  is 
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The  bandwidth  cost  due  to  redistribution  is 
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The  total  bandwidth  cost  is  the  sum  of  these  two  contributions.  The  latency  cost  is 
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as  long  as  a  <  2.  In  general,  one  should  choose  k  and  a  so  that  Wr  =  Wmm ;  there  will  still 
be  a  remaining  degree  of  freedom,  and  increasing  k  or  a  will  decrease  latency  at  the  expense 
of  higher  bandwidth. 

A  simple  example  is  the  square  case  m  =  n,  with  a  —  1  and  k  >  3/2.  Then  the  above 
costs  simplify  to 
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Note  that  by  taking  large  values  of  k,  the  latency  is  substantially  reduced  relative  to  the 
previous  algorithm,  at  to  cost  of  further  increase  to  the  bandwidth  term.  It  is  only  possible 
to  run  this  algorithm  if  m  >  P,  since  with  a  =  1  the  only  reduction  in  number  of  processors 
comes  from  splitting  m.  If  we  instead  take  y/2  <  a  <  2  and  k  >  5/2  and  define  b  =  k  +  log2  a, 
the  asymptotic  costs  are 
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and  the  algorithm  works  as  long  as  m  >  pfc/(fc+lo§2 a) . 

The  latency  costs  are  substantially  higher  than  the  lower  bounds  of  [10].  We  conjecture 
that  taking  into  account  dependencies  will  give  tighter  lower  bounds. 


5.5  Cholesky  Decomposition 


Sequential  Cholesky  decomposition  of  an  nx  n  matrix  can  be  done  recursively  by  making  two 
calls  to  Cholesky  decomposition  on  an  |  x  |  matrix,  one  call  to  a  square  triangular  solve,  and 
one  call  to  matrix  multiplication  [3,  8].  All  four  of  those  calls  must  be  done  in  order.  Thus 
our  parallelization  gives  the  recurrence  CH0L(n,  P)  =  2CH0L(n/2,  P/a )  +  0(GEMM(n,  n,  n,  P)  + 
TRSM(n,  n,  P)  +n2/Pf3).  For  the  costs  of  the  triangular  solve,  use  the  k  =  3/2,  a  =  1  algorithm 
from  the  previous  section;  further  reductions  to  the  latency  cost  of  the  triangular  solve  do 
not  reduce  the  latency  cost  of  the  Cholesky  decomposition. 

In  terms  of  flops,  we  have 
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for  a  <  4.  The  bandwidth  cost  is 
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as  long  as  a  <  4.  Note  that  increasing  a  from  1  to  23/2  decreases  the  latency  cost  without  any 
asymptotic  affect  on  the  bandwidth  cost.  The  interesting  range  is  23//2  <  a  <  4.  for  which  the 
costs  are 


F 


W  =  0 


n 


P loga  2 


n3  \ 

pVm)  ’ 


S 


O 


^ plos0  2  _|_ 


n3  \ 
PM3/2  )  ' 


This  algorithm  is  similar  to  the  pivot-free  generic  Gaussian  elimination  algorithm  of  [64], 
although  that  algorithm  requires  as  many  all-to-all  communication  phases  as  ours  requires 
messages,  and  so  it  may  be  less  efficient  in  practice.  The  costs  are  also  equivalent  to  the 
iterative  2.5D  Cholesky  algorithm  [37];  c  =  1  there  corresponds  to  a  =  4,  and  c  =  P1/3 
corresponds  to  a  =  23/2. 
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Chapter  6 

Discussion  and  Open  Questions 


The  original  impetus  for  this  work  was  the  search  for  a  communication-optimal  parallel 
algorithm  for  Strassen’s  matrix  multiplication.  Following  the  communication-cost  lower 
bounds  of  [9],  we  were  unable  to  find  a  matching  algorithm  in  the  parallel  case  in  the 
literature  (although  we  later  found  an  algorithm  quite  similar  to  our  CAPS  algorithm  in  [52]). 
Since,  unlike  most  linear  algebra  algorithms,  Strassen’s  algorithm  has  no  simple  iterative 
definition,  we  found  a  method  of  parallelization  that  respects  the  recursive  structure  of  the 
algorithm:  by  traversing  the  recursion  tree  using  BFS  and  DFS  steps. 

We  call  this  algorithm  Communication- Avoiding  Parallel  St.rassen  (or  CAPS),  which 
appears  in  Chapter  2,  but  the  idea  easily  applies  more  broadly  than  just  to  Strassen’s 
algorithm.  Many  other  fast  matrix  multiplication  algorithms,  including  those  with  exponents 
much  smaller  than  Strassen’s,  have  similar  recursive  formulations.  However  several  practical 
issues  prevent  their  use,  and  we  are  not  aware  of  any  practical  algorithm  with  a  substantially 
better  exponent  than  Strassen’s  algorithm. 

Question  1.  Is  there  a  matrix  multiplication  algorithm  that  is  faster  than  Strassen’s  algorithm 
in  practice?  If  so,  it  may  have  a  much  higher  branching  factor  than  Strassen’s  algorithm  (for 
example,  the  algorithms  in  [47]  have  thousands  of  recursive  calls  or  more  per  step).  Would 
the  BFS/DFS  approach  work  well  in  practice  given  the  high  branching  factor? 

One  other  practical  example  (though  with  higher  exponent  than  Strassen’s  algorithm)  is  the 
recursive  formulation  of  classical  matrix  multiplication.  Splitting  each  of  the  three  dimensions 
in  half  gives  8  subproblems  instead  of  the  7  subproblems  in  the  case  of  Strassen’s  algorithm. 
Applying  the  BFS/DFS  approach  to  this  algorithm  gives  a  recursive  communication-optimal 
parallel  algorithm  with  asymptotically  identical  costs  to  the  2.5D  algorithm  [52,  61]. 

In  fact,  the  classical  algorithm  has  more  implementation  freedom  than  Strassen’s  algorithm. 
Instead  of  splitting  all  three  dimensions  at  once,  one  may  split  only  one  at  a  time.  For 
rectangular  matrices,  it  is  natural  to  expect  that  splitting  the  largest  dimension,  to  create 
closer  to  square  subproblems,  would  save  on  communication  (as  in  the  sequential  case 
[34]).  When  we  first  applied  the  BFS/DFS  approach  to  this  idea,  it  did  indeed  outperform 
previous  algorithms  for  rectangular  matrices.  But  when  we  proved  lower  bounds  for  classical 
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rectangular  matrix  multiplication,  in  some  cases  they  were  lower  than  the  costs  of  our  initial 
algorithm.  To  match  the  lower  bounds,  we  realized  that,  with  the  right  data  layout,  only  the 
smallest  matrix  needs  to  be  redistributed  at  each  recursive  step.  This  improvement  gives 
the  CARMA  algorithm  as  presented  in  Chapter  3.  Unfortunately,  to  communicate  only  one 
matrix  at  each  recursive  step  seems  to  require  that  the  initial  data  layout  of  a  matrix  depends 
on  the  other  input  matrix.  There  does  not  seem  to  be  one  standard  layout  that  works  well  in 
all  cases;  rather  the  best  layout  depends  on  the  context. 

Question  2.  Can  a  library  be  developed  that  reaches  the  communication  lower  bounds  of 
Section  3. 1  for  any  sequence  of  matrix  multiplications,  possibly  by  changing  data  layouts  when 
appropriate? 

When  input  matrices  are  sparse,  the  problem  becomes  more  complicated.  Even  for  the 
square  case  that  we  analyzed  in  Chapter  4,  the  number  of  nonzeros  in  the  input  and  output 
matrices  may  be  very  different.  Thus,  as  in  the  rectangular  case,  a  natural  algorithm  is  to 
communicate  the  smallest  matrix  (measured  by  number  of  nonzeros)  and  split  the  other 
dimension  at  each  recursive  step.  For  matrices  with  non-uniform  distribution,  the  correct 
split  may  not  be  in  the  middle,  and  the  split  that  balances  data  size  may  be  different  from 
the  one  that  balances  work.  For  the  special  case  of  square,  random  matrices  with  sparse 
output  we  showed  that  this  algorithm  is  communication-optimal  among  sparsity-independent 
algorithms. 

Question  3.  Using  an  efficient  way  to  estimate  the  number  of  nonzeros  in  the  output  matrix, 
one  could  generalize  the  recursive  algorithm  to  arbitrary  sparse  matrices.  What  can  be  proved 
about  the  communication  costs  of  such  an  algorithm? 

Moving  beyond  matrix  multiplication,  many  recursive  algorithms  have  dependencies 
between  the  recursive  calls.  We  analyzed  several  of  these  algorithms  in  Chapter  5  (all  pairs 
shortest  path,  triangular  solve  with  multiple  righthand  sides,  and  Cholesky  decomposition), 
and  presented  bandwidth  optimal  algorithms  based  on  the  BFS/DFS  approach.  However  the 
algorithms  with  dependencies  do  not  match  existing  latency  cost  lower  bounds.  There  are 
several  open  questions: 

Question  4.  Can  stronger  latency  lower  bounds  be  proved  for  algorithms  with  dependencies? 
In  other  words,  is  the  bandwidth-latency  tradeoff  exhibited  by  the  best  existing  algorithms 
necessary? 

Question  5.  Would  implementations  of  the  algorithms  in  Chapter  5  perform  well  in  practice? 
In  particular,  is  the  technique  of  discarding  processors  at  each  recursive  step,  to  control  latency 
costs,  effective? 

Question  6.  Among  the  class  of  cache- oblivious  sequential  communication- optimal  algorithms, 
which  ones  give  communication- optimal  parallel  algorithms  using  the  BFS/DFS  approach? 
To  what  extent  can  the  parallelization  of  recursive  algorithms  be  automated? 
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